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PREFACE 


Tenth anniversaries are traditional occasions for reflecting on the past, evaluating 
trends in the current situation, and speculating on the shape of things to come. The last ten 
years have seen the growth and maturation of the International Conference on Parallel 
Processing from a two-day invitational meeting of RADC contractors (1972) at which seventeen 
papers on diverse aspects and applications of the Rome Air Development Center Associative 
Processor (RADCAP) were presented, to a truly international meeting covering all phases of 
parallel and distributed processing. There were 136 papers submitted to this conference, an 
all-time record, compared to 117 in 1980, 93 in 1979 and 1978, and over 80 in 1977 and 1976. 
The number of papers from abroad has also grown, being 34, 31, and 23 in 1981, 1980, and 1979 
respectively, from 16, 10, and 10 countries. As in previous years, the high quality of the 
papers submitted made final selection extremely difficult. We wish to thank the 214 referees, 
including 87 non-authors, for their indispensable aid in selecting the 66 finalists for the 14 
sessions of contributed papers. Each manuscript submitted was sent to three referees; only 
their prompt and conscientious help, despite their other obligations, made it possible to have 
the proceedings available on time. 


An innovation this year, replacing the traditional keynote speech of the ceremonial 
session the evening before regular sessions begin, is the panel of five invited speakers on the 
history of parallel processing. Professor Tse~yun Feng organized it, for which we are most 
grateful. He is acutely aware that much important history is inevitably hidden from those 
participating in it, and therefore requests help from attendees and others, in compiling as 
complete a record as possible. Now is the time to do it, before memories fade and pioneers 
pass on. 


Another innovation is the tutorial on parallel processing the day before the conference. 
Such tutorials have become increasingly popular adjuncts of many meetings, and this one was set 
yp in response to suggestions made by attendees of previous conferences. We hope it 
establishes a new tradition of excellence. 


A special issue of the IEEE Transactions on Computers on Parallel and Distributed 
Processing is planned for December 1982. Professors Ming T. Liu and Jerome Rothstein are the 
guest editors. Papers presented at this conference or modifications thereof will be considered 
for inclusion as will others submitted by respondents to this and other calls for papers 
appearing elsewhere. The closing date for submission of manuscripts is January 1, 1982. One - 
hundred pages have been allocated to the special issue; we hope it will be of permanent 
reference value. 


The growth of interest in parallel and distributed processing in the last decade has been 
explosive, and will doubtlessly continue unabated. This conference could easily have grown 
very large, with parallel sessions and many more papers. However, the attendees have voted, 
year after year, against departing from the traditions of no parallel sessions, emphasis on 
attendance by active workers in the field, and of holding the conference far from the competing 
attractions of a metropolitan or resort milieu. The opportunities for prolonged, intense, 
personal interactions with established and upcoming researchers were felt to outweigh 
disappointments like being put on a waiting list and not being able to attend because of the 
rarity of cancellations. All this can change in the future, but only if the attendees wish it 
to. 


We would like to thank Dean Donald D. Glower, College of Engineering, The Ohio State 
University for his constant encouragement, and Professor Tse-yun Feng for his sage advice and 
counsel about the endless details of managing this enterprise. The assistance of Professor 
Chuan-lin Wu is also appreciated. Last, but not least, we appreciate the devoted assistance of 
Jy-jine Lin in computerizing so much of the routine involved. 


Ming T. Liu and Jerome Rothstein 
Technical Program Co-Chairmen 
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PARALLELISM IN COMPUTING 


Sidney Fernbach 
Control Data Corporation 
Livermore, California 94550 


Abstract. The parallelism in computers is 
reviewed from the early systems, such as the 
Univac I to the present day systems. Some degree 
of parallelism has always existed, sometimes for 
reliability, at other times for improved perfor- 
mance. With the highly reliable components 
currently in production, the main reason for 
today's parallelism is to obtain as high a 
performance as possible for the dollar. 


There has always been a degree of parallel- 
ism in digital computers as we know them today. 
At first it was for reliability purposes, later 
to achieve greater performance as well as for 
reliability. 


The first commercially available computer was 
the Univac I designed initially for Census Bureau 
work. It was designed as a decimal machine, hav- 
ing 6 bits to represent alphanumeric characters. 
Because of the fact that it used mercury delay 
lines, the machine was highly serial, sending 
bits down the delay lines one by one. On the 
other hand, there was more duplication of curcuits 
in Univac I than in most machines built since. 
Checking was provided by automatic comparison 
of results coming out of duplicate arithmetic 
circuits. This of course was done for reliabil- 
ity purposes, there being 5600 vacuum tubes in the 
system. Incidentally this structure was also true 
of the BINAC which was conceived earlier than the 
Univac. 


Other computers of the same vintage, (late 
40's and early 50's) used either relays, drums 
or delay lines and were in the most part serial 
in nature. When the electrostatic tube came into 
use, soon thereafter, most machines were binary in 
nature; fetching, storing and operating on words 
in a parallel mode. The earliest of these seem to 
have been the Bureau of Standards SEAC and MIT 
Whirlwind. Others soon followed -- mostly the IAS 
family of computers as well as some of interna- 
tional flavor such as those built in Manchester, 
England. The commercial vendors quickly came out 
with their version; IBM with the 701 and ERA with 
the 1103. These were 36 bit binary computers. 
For the most part these machines were highly 
serial. 


It was recognized even in the early 50's that 
performance could be gained through more parallel 
operations, but few designers or manufacturers 
thought it imsortant enough to go all out for 
performance. Early machines were pushed strongly 
by the Department of Defense for use in crypto- 
graphic work. Later the AEC, needing much high- 
er performance than that made available with the 
701/704 or 1103/1103A started to stir the pot with 
specially built systems incorporating parallel 
design. One of the first of these was the LARC. 

A version of this was specified by the Lawrence 
Livermore National Laboratory in early 1955. 
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It called for a number of processors sharing a 
common memory. The initial specs were far more 
demanding then those that ended up in the machine 
that was finally built. They required both 
binary and decimal arithmetic units, for example. 
The final version was an all decimal machine 
allowing for two CPU's and an I/O processor 

to function concurrently. Unfortunately there 
never was enough money in the budget to acquire 

a 2 CPU system, although hardware allowances 

were made for the addition of a second unit. The 
two LARC's eventually built and delivered (in 
1960) had but one CPU and the one I/O processor. 
The memory of the system could have up to 39 
independently addressable parts each of 2500 
words for a total of 97,500 words. The delivered 
systems had only 30,000 words. Input-output was 
taken care of by the issuance of summary commands 
to the processor unit. The CPU's alerted the I/0 
Processor to their presence and also checked for 
completion. Memory overlapping allowed for one 
instruction to be executed while the operand 
address was being transferred to/from memory 

and the operand address of another instruction 
was being indexed. The memory bus was time slot- 
ted so that systems had access to one or more of 
the 8 time slots of 0.5 sec. each. 


The main back-up memory in this system con- 
sisted of up to 24 magnetic drums which allowed 
for 3 read and 2 write operations to take place 
concurrently. 


The IBM 7030 or STRETCH was designed and 
built at the same time as LARC. It also incor- 
porates a great deal of parallelism. The most 
interesting is the look-ahead feature. While 
one instruction is being executed several more 
may be fetched and interpreted. Unfortunately 
branching, if it occurs, forces the look-ahead 
to undo what it had already done. 


Another interesting machine designed in the 
50's was the Gamma 60, designed by Compagnie 
des Machines Bull in Paris. This system con- 
sisted of a variable number of independent 
and different processors, sharing common two-way 
distribution busses. The processors did not have 
to be identical; as a matter of fact there were 
four different types. The Central Control Unit 
had 2 major subunits, the Transfer Distributor 
(TD) and the Program Distributor (PD). Priority 
decisions were made in this unit, data transfer 
requests being handled by the TD and instruction 
requests by the PD. 


Another interesting machine of the same 
vintage was the RW-400 or Polymorphic Data 
System. This was built by Ramo-Wooldrich 
Computers. It used a large cross bar switch 
to interconnect computer Modules/Buffer 
Modules with peripheral device modules. One 
of the computer modules acts as master and the 


others as slaves. Any data from peripherals may be 
requested, stored in a buffer module until needed 
then moved directly to the requested computer 
module. 


The National Bureau of Standards also built 
a multiprocessor called PILOT. It had 3 process- 
ors, each different from the others. One process- 
or was the arithmetic unit, another the house- 
keeping unit and the third the I/0 processor. 


The LARC and STRETCH served the scientific 
world well for a number of years, despite the 
fact that each had its problems and was delivered 
late (1960-1961). By that time the transistor 
generation was upon us and numerous highly capable 
machines were on the market. None of them matched 
the LARC and STRETCH in performance, but their 
levels were gradually being reached. Some of 
the features in these two systems crept into 
others. 


During this same period of time there were 
non-general purpose commercial machines that were 
also being built with parallel features, but I 
am not going to discuss these here. For example, 
FAA had a unique requirement for utmost reliabil- 
ity and hence multiple systems usually were built 
for this agency. Also the seismic industry had 
great need for high performance "vector" type 
of operations performed on Array Processors 
attached to standard equipment. 


The first big jump in performance by way of 
concurrency after LARC/STRETCH was found in the 
CDC 6600. This machine had 10 functional units 
as well as 10 peripheral processors. Each 
peripheral processor had its own memory for 
programs and for buffer space. Each can interrupt 
the central processor and monitor the central pro- 
gram address. Each PP takes one minor cycle (100 
ns) or 1/10 of the major cycle as its slot to per- 
form one of its steps. | 


The functional units consist of 2 multiple, 
2 add, 1 divide, 1 shift, 1 branch, 1 Boolean, 
and 2 increment units which can be operating con- 
currently, each being initiated at the start of a 
minor cycle. 


The 7600 was a follow on the 6600 with high- 
er speed components. In organization it was. very 
Similar. The chief difference was in the memory 
organization, a high speed memory of 64K words 
was backed up by a large 512K word slower memory. 
Again parallelism came to the rescue. Eight word 
"swords" could be read out of large core with one 
instruction. There was also a high speed swap 
that enabled communication between the two memor- 
ies to permit operations at high speeds. The 7600 
was about 5 times the 6600 in performance. IBM | 
had less parallelism in its equivalent level mach- 
ines named 360/91(95) and 370/195. 


Even as these machines were being designed 
and built, there were other efforts to provide 
even greater parallelism. Dan Slotnick, from 
whom you will hear the historical background in 
more detail and with more accuracy, had designed a 


system he called SOLOMON. This was accomplished 
while he was at Westinghouse, although the ideas 
had been percolating in his mind while he was 
still at IBM. 


This system was to have 1000 processors, 
each with its own memory operating in unison 
at commands of a central instruction issuing 
unit. They worked in lock-step fashion, such 
that, when an add instruction came along, each 
did its add using operands in its own memory, 
concurrently with the others. Thus a factor 
of 1000 could be achieved in performance over 
a single processor (if all could be in operation 
Simultaneously). There was a lockout feature, 
so that if 1000 pairs of operands were not 
available, some processing elements would remain 
idle. To handle certain types of mathematical 
problems more effectively, each processor was 
enabled to communicate with its 4 nearest 
neighbors. 


The Lawrence Livermore Laboratory initiated 
attempts to have DOE (then AEC) order a system 
from Westinghouse. Unfortunately, when the 
top management at Westinghouse learned how much 
money IBM and Univac were supposed to have lost 
on STRETCH and LARC, respectively, the corporation 
got cold feet and backed out. LLL, being still 
interested in the concept of parallel processors 
tried to find other manufacturers to build a 
system. IBM showed some interest and had one 
of its excellent architects, Jim Pomerene design 
a SOLOMON-like machine. It incorporated all the 
latest technology IBM had come up with for the 
360/90 system. Instead of 1000 processors, 
only 32 were proposed, but these were each very 
powerful units in their own right. This PNDC, 
as it was called also never saw the light of day. 
Other computer researchers and designers at IBM 
decided that pipelined structures were better 
than a parallel network of processors. They con- 
vinced the IBM management to give up PNDC. For 
a time this seemed like the end of the road. DOE 
(AEC), NSF, and ARPA representatives met to dis- 
cuss the situation and to decide whether it might 
be possible to join forces in having a SOLOMON- 
like machine built. Before either AEC or NSF 
could collect its resources, ARPA, with Ivan 
Sutherland in the lead was off and running. 

John Foster, who was head of D.D.R. and E. in- 
vited a group of "experts'' in to decide on 
whether or not to fund such a computer. The 
decision was "go". The resulting machine was 
ILLIAC IV, originally intended for Dan Slotnick's 
lab at the University of Illinois, but installed 
upon completion at NASA/Ames in California in- 
stead. The initial intent in this system was to 
have 256 processors, each with 2000 words of 
memory. Because of rising costs, the number of 
P.E.'s was reduced to 64. 


One interesting feature of the ILLIAC IV. 
which does not usually get much attention is the 
high performance disk associated with it. Early 
in the actual operation of ILLIAC IV, it was found 
that the 2000 word memory was too small, so the 
disk-file subsystem actually was made the main, 
memory. There were dual files, each with 5x10 


bits of storage capacity and each being to sustain 
a data flow rate of 500 megabits/sec. Since the 
data path to the array was 1 billion bits wide, | 
it was possible with proper synchronization to 
obtain a very high band width interchange with 

the processor memories. Used this way, the 

Illiac IV, for certain problems demonstrated 
performance not yet matched by more modern 
computers. 


When Illiac IV was contracted for and got 
under way, it was considered an experimental ma- 
chine. The intent was to learn to use such a 
system for solving large scale problems which kept 
growing in size. Since there was the chance for 
failure and since other concepts like "pipelining" 
were being proposed, LLL with the consent of AEC 
decided to try the alternate route of the pipe- 
lined machine. Again this was to be experimental. 
A contract was negotiated between LLL and Control 
Data Corporation which resulted in the STAR-100 
computer system. Simultaneously, Texas Instru- 
ments which was involved in the early work on 
Illiac IV became interested in building a 
"pipelined" machine. With internal customers 
initially, T.I., went ahead with the project that 
resulted in the Advanced Scientific (or Seismic) 
Computer (ASC). Both of these machines had 
multiple pipe capabilities; the STAR relied 
on external processors to handle I/0, the ASC had 
its own peripheral processor. Performance on 
these machines for highly vectorized problems 
was very good. Scalar capabilities were very 
poor. 


Overall impressions left with the computing 
community concerning the vector computing systems 
of the late 60's - early 70's were bad. Only 
one Illiac IV, 4 STAR - 100's, and 7 - T.I. - 
ASC's were delivered. It wasn't until the late 
70's that faith was restored in high performance 
machines. Seymour Cray, now of Cray Research, 
Inc., was able to build a high performance scalar 
system thoroughly integrated with a vector system 
into a beautiful package. Now, scalar problems 
could run faster than on any other system, and if 
any degree of vectorization was possible, the 
additional parallelism improved performance 
substantially. Each, the scalar and vector 
processor had functional parallelism as well. 
Chaining of vector operations was also possible. 


The realization of the need for scalar 
processing did not go unnoted by CDC. A new 
machine was designed to replace STAR-100. This 
was done in two steps. The first, resulting 
in Cyber 203 added a scalar unit to the two-pipe 
vector system and at the same time replaced the 
original core memory with a speeded-up semi-con- 
ductor memory. The second step resulting in the 
Cyber 205, replaced the vector unit with a faster 
LSI unit, now with up to 4 identical pipes. 


Not to be outdone by the others, Burroughs 
Corporation, who had built the Illiac IV now 
decided to build a much superior version, named 
the Burroughs Scientific Processor (BSP). This. 
machine was designed to have 16 processors. A new 
algorithm was employed in this system to permit 


access to memory with no conflict. This time the 
memory was accessible to all processors. Two 
levels of memory were employed; 524 K words of 
parallel processor memory and 4 M words of file 
memory. There were in addition to the central 
processor, an I/O processor and maintenance 
processor also. It was even possible to have two 
BSP systems tied together with a system manager. 
Actually the system manager was the front-end 
standard B7800 computer system. In this pro- 
cessing system there was no real scalar processor; 
one had to rely on the frontend. In this descrip- 
tion I have used the past tense, because as of 
this time the BSP has been abandoned as a 

product. 


Burroughs is not quite out of the large 
scale parallel processor design completely, 
as yet. There is an on-going effort to design 
a l Gigaflop machine for NASA/Ames to carry 
out Navier-Stokes calculations. This machine 
as described in earlier papers has 512 processors 
working concurrently. This may change, of course 
in this final year of preliminary design. Bur- 
roughs is competing with Control Data for this 
NASA contract; the Control Data design is more 
along the lines of the Cyber 200 series of 
machines. Burroughs has had much more experience 
in parallel systems. Besides the above mentioned 
systems, the Corporation built a PEPE prototype. 
This was a machine originally designed by the 
Bell Telephone Laboratory for use in Ballistic 
Missile Defense systems. Whether there will be 
a follow-cn to PEPE is hard to say at this time. 
This was also a multiprocessor. _ 


Other parallel processors have been designed 
and built primarily for special purposes such as 
image processing. One is currently being con- 
structed by Goodyear for NASA/Goddard. This is 
a follow-on to a bit oriented machine called 
the STARAN. ICL-in England also built a similar 
machine with 4096 processors. This one is 
called DAP - one is in operation at St. Mary's 
College in London. 


Because of the great strides made in 
microprocessor development, performancewise as 
well as costwise, there are any number of 
attempts to assemble numerous microprocessors 
in a multiprocessor system. As with most computer 
concepts, muitiprocessing is rather an old one. 
Some early versions have already been mentioned. 
Dual processors are commonplace, most manufac- 
turers having tried their hands at these at 
some time. 


The most ambitious attempts have been made 
by Carnegie-Mellon University, first with its 
C.MMP having 16 minicomputers tied together 
and later with its CM*, with 50 processors tied 
together in a number of modules. The hardware 
configurations are relatively easy to provide. 

The software provided the rub. Good system 
software and efficient algorithms for applications 
are much harder to devise. 


Other recent attempts to provide high 
performance systems are those of CDC, Denelcor, 


and the Lawrence Livermore Laboratory. CDC was | 
built and delivered and Advanced Flexible Process- 
or (AFP) made of the same LSI components used in 
Cyber 205. This system consists of 4 modules 
each with variable functional units structured 
in a ring type architecture. The initial system 
was to be used for a special application and did 
not need floating point. The Denelcor system, 
also at this point in time designated in a 4 _ 
processor configuration is being constructed | 
for the Aberdeen Proving Ground to be used in 
Ballistic Calculations. 


The Livermore system called S-l is being > 
sponsored by the U.S. Navy to be used for signal 
processing. In this case 16 memories are being 
tied to 16 processors by a cross-bar switch. 


Whether the parallelism being put into a 
multiprocessor is capable of being effectively 
utilized has yet to be demonstrated. Certainly 


the availability of a large number of processors 
at low cost implies that many can remain idle if 
the overall performance can be increased. It | 
seems plausible that the future designs should 
incorporate "pipelined" processing in multi- 
processing elements. 


As for the future, we seem not to be making 
as much headway as we should. Kung and his 
systolic approach, Dennis and Company and their 
Data-~Flow concepts seem to have much merit. It 
is too early to say that we will see such systems 
before the 90's -- but it seems unlikely. The 
more ingenious young people in their experiments 
with microprocessors no doubt will dream up 
better ways of designing parallelism into 
computers. Our main hope, however, is that 
the problem designers and software experts will 
help make it possible to take advantage of all 
these concepts in the not too distant future. 


HISTORY OF PARALLEL PROCESSING AT GOODYEAR AEROSPACE 


W. C. Meilander 
Digital Systems Marketing 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Abstract 


Associative memories have been talked about 
in scientific circles for a long time. This paper de- 
scribes some of the first efforts to bring that talk 
into the realm of reality. At Goodyear Aerospace, 
we continue to develop techniques for fabricating 
and using associative memories. The techniques 
used in associative memories are presented, and 
the disadvantages of the methods used at any time 
are discussed. The associative memory (AM) logic- 
ally leads to the associative processor (AP). Most 
of the advantages of AM's are retained in AP's, and 
many new capabilities are added. In fact, the AP 
is becoming a very powerful tool in handling the 
highly dynamic data bases of air surveillance and 
command and control systems. 


The associative processing effort is augmented 
by the endeavors associated with the microcomputer 
array processor (MAP) and the massively parallel 
processor (MPP). The MAP and MPP broaden the 
capabilities of parallel processing into the fields of 
electronic warfare and image processing. 


Background 


Vannevar Bush made a strong case for asso- 
ciative processors in 1945(1): "There is a growing 
mountain of research. But there is increased evi- 
dence that we are being bogged down today as 
specialization extends. The investigator is stag- 
gered by the findings and conclusions of thousands 
of other workers - many of which he cannot find 
time to grasp, much less to remember - as they 
appear. Yet specialization becomes increasingly 
necessary for progress, and the effort to bridge 
between disciplines is correspondingly superficial." 


Dr. Bush continues: "But there are signs of 
change as new and powerful instrumentalities come 
into use." He then discusses many of the discov- 
eries made in the past few centuries that have led 
to the increased activity of the 20th century. He 
cites the importance of communication in the scien- 
tific world with: "Mendel's concept of the laws of 
genetics was lost to the world for a generation be- 
cause his publication did not reach the few who 
were capable of grasping and extending it; and 
this sort of thing is undoubtedly being repeated 
all about us, as truly significant attainments become 
lost in the mass of the inconsequential." 


Dr. Bush considers the application of machines 
to "logical processes" with "formal logic used to be 
a keen instrument in the hands of the teacher in 
his trying of students' souls." He then describes 
approaches for selecting pertinent information from 
the mass of data available. His discussion of "memex 
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instead of index" states: "Our ineptitude in get- 
ting at the record is largely caused by the artifi- 
ciality of systems of indexing. When data of any 
sort are placed in storage, they are filed alphabet- 
ically or numerically, and information is found 
(when it is) by tracing it down from subclass to 
subclass. It can be in only one place, unless dup- 
licates are used; one has to have rules as to which 
path will locate it, and the rules are cumbersome. 


- Having found one item, moreover, one has to emerge 


from the system and re-enter on a new path. 


"The human mind does not work that way. It 
operates by association. With one item in its grasp, 
it snaps instantly to the next that is suggested by 
the association of thoughts, in accordance with 
some intricate web of trails carried by the cells of 
the brain. It has other characteristics, of course; 
trails that are not frequently followed are prone to 
fade, items are not fully permanent, memory is 
transitory. Yet the speed of action, the intricacy 
of trails, the detail of mental pictures, is awe in- 
spiring beyond all else in nature. 


"Man cannot hope to fully duplicate this mental 
process artificially, but he certainly ought to be 
able to learn from it." 


Early Effort 


Many have agreed with Dr. Bush. Activity 
has continued since 1945 to develop the concepts 
espoused by him. These efforts, at best, only 
scratch the surface of the thoughts in Dr. Bush's 
paper. Yet, they appear to offer relief from some 
of the laborious indexing tasks that are bogging 
down our present endeavors to retrieve relevant | 
data from an ever-increasing data base. An asso- 
ciative memory offers a system that allows retrieval 
of related data from a memory based on the data 
content in the memory. This can be understood 
when it's realized that an associative memory can 
directly implement a relational data base. In this 
context, the machine appears to emulate capabilities 
of the human mind. 


Slade and McMahon ribs Gano a cryotron cata- 
log memory system in 1956\“’. This paper is gen- 
erally accepted as the earliest record of a hardware 
approach to the problem of searching memory by 
content instead of address. The Western Reserve. 
University (WRU) search selector, discussed below, 
may be an earlier effort. 


In 1958, Goodyear Aerospace - while working 
with the concepts of associative memories - held a 
number of discussions with Dr. Jim Perry and Dr. 
Allen Kent. Perry and Kent were working with 
the techniques of information retrieval at the School 


of Library Sciences at Western Reserve University. 
Their work covered one of the earliest associative 
processors fabricated. The concepts of their ap- 
proach were presented in 1955. The machine de- 
veloped was called the WRU search selector. It 
was designed to search a document data base. 


The search selector(3) (Figure 1) was design- 
ed and built by Perry in 1956. It was a relay ma- 
chine and used a Flexowriter tape reader to input 
the data base to be searched. 


Fig. 1 - Western Reserve's Search Selector. 


The data base was formed from information ab- 
stracted from documents by knowledgeable review- 
ers. Keywords of the abstract were encoded (often 
by the same reviewers) via a dictionary into four 
character groups. The encoded information along 
with the document accession number was stored on 
punched paper tape. 


Queries were encoded using the same diction- 
ary. The queries were stored in the search selec- 
tor. Ten independent queries could be entered at 
one time. The system provided for queries using 
logical AND, OR, and EXCLUSIVE-OR operators 
and combinations of these operators. The search 
selector program was entered through the patch 
cord system shown in Figure 1. After the query 
was programmed, the punched paper tape data file 
was passed through the system. Whenever a query 
was satisfied, the document accession number was 
read from the tape and typed along with the num- 
ber of the query. The machine was used for sev- 
eral years in searching a file of documents for mem- 
bers of the American Society of Metals. A General 
Electric 225 computer replaced the WRU search 
selector about 1960. 


Why Associative Memories? 


-The concept of associative memories derived 
from many different requirements. In the WRU 


machine, it was used for evaluating a coded re- 
quest against a file of coded documents. In many 
other cases, the requirement, similarly, stemmed 
from the desire to search unordered data. At 
Goodyear, efforts were underway to find a method 
for locating items in memory on the basis of mem- 
ory contents. This early activity was prompted by 
a desire to examine the present position of a large 
number of simulated targets being updated through 
a digital differential analyzer.. The goal was to lo- 
cate and display each target at the proper time as 
a simulated antenna scanned the space. The store 
would be searched in the azimuth field for the cur- 
rent azimuth and the associated target range read 
for display. The search needed to be completed in 
a few microseconds. We wanted a faster approach 
than software could provide. 


Software Approaches 


Software associative searches were performed 
in sequential machines when the amount of data 
stored was small. Breakthroughs in list processing 
were achieved when such techniques as hash cod- 
ing, chained lists, and inverted files were imple- 
mented. These techniques eliminated the need for 
laborious searches of unorganized data (unless the 
field you were searching was not a key field). 
They also generated the complex file structures in 
use today, with their attendant complex update 
problems. The user does not realize the extent of 
the management software, since these complex file 
management structures are often a part of today's 
operating system. 


Hardware Approaches 


The desire to break away from the limitations 
of the sequential processor prompted much effort in 
the early 1960's. At that time, hardware techniques 
were advancing, and a variety of associative de- 
vices were suggested. Prominent among these were 
cryotrons, tunnel diodes, magnetic cores, magnetic 
films, and multiaperture magnetic devices. 


‘In 1959, Goodyear Aerospace began using 
multiaperture magnetic devices in associative mem- 
ories. Several problems existed in the application 
of magnetic devices to associative memories: 


1. A non-destructive method for evaluating 
the storage state must exist. (When magnetic cores 
are used for storage, a chosen word was destruc- 
tively read and rewritten. If one were to interro- 
gate an entire core memory, as is necessary in an 
associative memory, all data must be read and re- 
written simultaneously, which would be impractical.) 

2. A low signal-to-noise level exists when a 
magnetic device is non-destructively interrogated 
(pulses of short duration must be amplified and 
distinguished from noise). | 

3. High energy is required to change the de- 
vice state. This is true of all ferrite storage sys- 
tems. 7 oo 
4. Switching times of the storage elements are 
relatively slow compared with other devices such a 
the cryotron.: si | 


To evaluate multiaperture magnetic devices 
for associative memories, transfluxors were pur- 
chased from RCA. Limitations of the transfluxor 
led to the development of a multiaperture logic 
element (MALE) (see Figure 2) and a model content 
addressable memory using the MALE. 


The MALE(4) provided for storage of data in 
a word direction. A simultaneous exact match 
search of all stored words in memory could be re- 
alized. The MALE could be interrogated non-de- 
structively and provided an EXCLUSIVE-OR oper- 
ation. Initially, a response store was set for each 
word. The interrogation was made and reset the 
searched word that did not match the.query word. 
The words that remained matched the query. In- 
terrogation time of the MALE was about five micro- 
seconds. Limit searches in the MALE proceeded on 
a bit serial basis (five microseconds per bit). 
Greater than or less than search used the EXCLU- 
SIVE-OR logic, at the stored bit level, to test the 
memory state for either greater than or less than 
the input argument. 
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Fig. 2 - MALE Flux-State Diagram 


Search Memory 


The MALE was used to implement an associa- 
tive memory for evaluation with the U. S. Navy's 
USQ-20. The search memory(5) (Figure 3) had 
256 words with 30 bits per word. A block diagram 
of the search memory is shown in Figure 4. The 
machine instructions included write, erase, exact 
match search, greater than search, less than search 
and a number of optional instructions, no response 
required, response required, mask, no mask, 
count responders, etc. The memory was delivered 
in 1963. 


Fig. 3 - Search Memory 
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Fig. 4 - Search Memory Block Diagram 


The MALE elements used in the NTDS search 
memory were difficult to fabricate; thus, a search 
was conducted for more readily available elements 
to implement the EXCLUSIVE-OR function. Asa 
result, it was found that a conventional toroidal 
core could be interrogated without destroying its 
state. 


BILOC 


A toroid can be non-destructively interrogated 
using cross field switching techniques (6,7). How- 
ever, the high cross field current, low signal level, 
and critical wire alignment mitigate against good 
performance. Apicella and Franks(8) discovered 
that applying a transverse bias field to the core 
reduces problems. 


The static bias field results in: 


1. A reduction of core switching time to about 
one-third of the unbiased switching time. 
_ @ An order of magnitude increase in the 
cross field non-destructive output voltage. 
3.. The ability to achieve a logical EXCLU- 
SIVE-OR function in the core. 


Thus, a storage/logic element is produced 
that (1) can store a state, (2) be non-destructively 
interrogated, and (3) can provide a match or no- 
match comparison between the stored state and the 
interrogation. That is, the Boolean expression 
AB + AB produces zero output, and the expression 
AB + AB produces aone output. A is the query 
state, and B is the stored state. 


This element, a biased logic core, was named 
BILOC. BILOC required very fast rise time pulses 
since the output voltage existed only during the 
pulse rise (or fall) time. Pulse rise times of the 
order of 20 nanoseconds and currents of about one 
ampere were used. The transverse bias field was 
of the order of 100 oersteds. 


RADC Associative Memory 


BILOC was used in implementing and deliver- 
ing an associative memory in 1966 to the Rome Air 
Development Center.4 The RADC associative mem- 
ory had 2048 words of storage. Each word was 48 
bits long. The associative memory was coupled via 
DMA to a CDC 1604B host computer. In operation, 
data to be searched was moved from the 1604B to 
the associative memory. The queries were then 
moved from the 1604B to the associative memory 
along with a response request. Results were trans- 
ferred from the associative memory to the host 
1604B. 


A comprehensive set of instructions provided 
for conventional read/write of the memory and a 
set of logical interrogations, which included: 


1. Input interrogand, equal, not equal, 
greater than, greater than or equal, less than, 
less than or equal, next higher value, and next 
lower value. | 

2. Find the maximum or minimum value. 

3. Resolve instructions such as read first/ 
next responder address or data, count responders, 
jump on no response (or its inverse). 

4. Write next available location or write at 
given address. 

5. The capability of concatenating searches 
to implement complex searches. 


The RADC associative memory brought out 
several facts. Among these were: 


1. The desirability of dropping the parallel 
search capability since only exact match searches 
could use this feature. 

2. The desirability of processing selected 
entries in memory. Since transferring them to the 
host required time, the associative memory could 
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operate at only 35 percent of its capability because 
of the necessity for input/output. 
3. The desirability of a wide band I/O path. 
4. The desirability for an internal program 
store to minimize I/O with the host machine. 


Associative Processing 


These facts led to a goal at GAC; namely, to 
achieve full parallel processing within the associa- 


tive memory. That is, make it a true associative 


processor. The associative processor would accept 
unprocessed data at its input and produce pro- 
cessed results at the output, thus greatly reducing 
the input/output requirements and making greater 
use of the machine's capability. 7 


However, the extremely high energy demands 
for the simultaneous write of 2048 cores needed to 
realize associative processing necessitated a search 
for a storage medium that was more easily alter- 
able. The search led to plated wire. Plated wire 
offered the features of relatively low interrogate 
and write currents and was easily fabricated in our 
laboratories. Goodyear Aerospace conducted plated 
wire R&D from 1965 until 1969. 


Plated Wire Associative Processor 


In 1969, Goodyear Aerospace examined an air 
interceptor processing task() and demonstrated a 
plated wire associative processor (10, 11,12,13) ‘ 
The machine used a bit. slice-oriented organization 
(Figure 5). The bit slices could be interrogated 
at a 100-nanosecond rate. Input was either to a 
bit slice or to any 16-bit word location in the ar- 
ray. Output from a selected word in the array was 
bit serial. In addition to conventional read and 
write operations, the array performed a large set 
of search, logic, and arithmetic operations at high 
speeds. 
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Fig. 5 ~ Bit Slice-Oriented Organization 


Search operations are exact match with com- 
parand, mismatch with comparand, greater than 
comparand, less than comparand, between limiting 
comparands, search flag, maximum value, and 
minimum value. 


Logic operations are set response toggles, 
reset response toggles, complement toggles, shift 
response toggles, write flag from response, and 
write common to selected words. 


Arithmetic operations are add common argu- 
ment, subtract common argument, add memory 
fields, subtract memory fields, multiply memory 
fields, divide memory fields, multiply by common 
argument, and divide by common argument. 


The Knoxville Experiment. The plated wire 


associative processor, under contract to Univac, 
was programmed and installed at the Knoxville, 
Tennessee, air traffic control terminal. Several 
firsts were realized for this FAA installation. They 
were: automatic track initiation and update on bea- 
con and primary radar reports, automatic turn de- 
tection, Mode C altitude tracking, air-to-air con- 
flict prediction, conflict resolution, and automated 
voice advisory warning against other aircraft and 
terrain. 


The plated wire associative processor had 
several drawbacks. Among these were the lack of 
production wire for the memory, the small signal 
output from the wire, and the requirement for bit 
serial readout of data from the array. 


Early Integrated Circuit Efforts | 


About 1970, LSI content addressable memories 
(CAM) began to appear from companies that had 
integrated circuit capability. These CAM's offered 
advantages over plated wire such as low switching 
current, high speed, and parallel readout. But 
there were problems. 


A study by Shore and Polkinghorn (14) con- 
sidered a word-parallel, bit-parallel LSI associative 
processor. This cellular organization required 
about 130 transistors at each bit of storage and 
would provide parallel limit search and arithmetic 
operations (this is in contrast with the serial oper- 
ations Goodyear Aerospace had been using). Later, 
Shore concluded in a paper 15) entitled "Second 
Thoughts on Parallel Processing" that parallel com- 
puters would be extremely expensive and could 
never compete with the conventional processor. 

His results seem correct when based on the cellu- 
lar organization he had earlier studied. The gen- 
eral conclusions reached in Shore's "Second 
Thoughts" paper apparently assumed that all paral- 
lel processing hardware would require logic at the 
bit level. 


If processing hardware in a parallel machine 
is implemented within storage at the bit level, then 
system cost increases nearly linearly with the 
amount of storage as Shore's paper indicates. 
Further difficulties ensue because of the necessity 
to access each stored bit in both the word and bit 
direction. Figure 6 shows an organization for a 
content addressable memory. The cost of imple- 
mentation was. quite high. For example, wiring a 
CAM using a typical CAM chip of 64 bits required 
3b+2wt+8c connections, where b is the number of 
bits, w is the number of words, and c is the number 
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REGISTER REGISTER 


READ REGISTER 
Fig. 6 - Content Addressable Memory Organization 


of chips needed. Then, making an array of 256 
words by 256 bits would require 2304 leads. Good- 
year Aerospace concluded that this would be an 
unsatisfactory approach. A search for a method to 
use conventional memory devices in a bit or word 
mode was realized in Batcher's invention of the MDA 
memory and flip network‘16,17 These inven- 
tions had their genesis in Batcher's work on sorting 
networks 


Integrated Circuit Associative Processor 


The MDA memory and flip network allowed con- 
ventional memory chips to be written in a word mode 
and: read in a bit slice mode or vice versa. The in- 
ventions yielded an associative memory capability 
with only slightly greater amounts of hardware than 
a conventional memory. An associative processor 
was easily realized with a simple bit serial process- 
ing element (PE). One PE configuration is shown 
in Figure 7. The PE logic functions are given in 
Table 1. 
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Fig. 7 - Associative Array Block Diagram 
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Design of a solid-state associative processor ed by the generation of 250 simulated tracks based 
began in 1971. The first STARAN system (19) was on 250 four-leg flight plans entered into the machine. 
completed in April 1972 and was demonstrated at The ATC operations performed are listed below. 
the International Air Exposition "Transpo 72" at ee, : 
Dulles International Airport. A STARAN B system 


Radar input processing 
is shown in Figure 8. 


Primary 2D tracking 
Secondary 2D tracking 
Altitude tracking (Mode C) 
Flight plan update 

Target simulation 
Maneuver detection 
Conflict prediction 

. Conflict resolution 

10. Automatic voice advisory 
ll. Keyboard processing 

12. Full digital display processing 
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The system was set up and demonstrated with 
live radar in six locations in the United States and 
Canada. The demonstration could be speeded up, 
in simulation, by a factor of 30 times. This yielded 
effective performance as if: 


e 7500 flight plans were updated per 10-second 


scan 
Fig. 8 - STARAN B System e The new flight plan position generated 7500 
radar reports, which were used to update 7500 
3 | tracks 

The demonstration showed the capability of an e 30 displays were being driven by the system. 
associative processor to handle the air traffic control | 
(ATC) processing requirement. Figure 9 shows the In 1972, another R&D effort - a "real" rela- 
program flow in the STARAN S-500 programmed to tional data base - was implemented in STARAN. 
operate with up to 500 aircraft tracks. In this sys- The system used a parallel head disc and retrieved 
tem, digitized radar reports were received via data and entered data based on the content of the stor- 


link from an ARSR radar site. This was supplement-- ed data : : 
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Fig. 9 - STARAN S-500 Program Flow 


STARAN B installations were made at Rome 


Air Development Center, Defense Mapping Agency, 


Engineering Topographic Laboratories, Johnson 
Space Flight Center, and Goodyear Aerospace. 


MESSAGE 
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AP 


DISPLAY 
DRIVER 


(1) AWACS passive (24) and active(25) tracking, 


(2) data base 
processing‘\“'» 


Many applications were studied or were programmed. 
Some of them are for the space environment; catalog 
maintenance; detection and surveillance; weapons 


support; oo 
status (22,23), A number of suggestions about 
STARAN B were incorporated in the STARAN E: 


identification; and sensor systems 


management 
28,25). 


, and (3) image 


STARAN E 


STARAN B was followed by STARAN E in 1975. 


STARAN E provided improvement over the earlier 
STARAN B in several areas (Table 2). 


TABLE 2 - STARAN B/STARAN E COMPARISON 


Item 
Array page size 


Max storage/array 


Parallel I/O data rates 
Cycle steal 
Host interface 
Proc-to-memory bandwidth 
Processing rate (ops/sec) 
16-bit add 


16-bit search 


STARAN B 


256x256 

1 page 
0.008 Mbytes 
80 Mbytes/sec 
No 


Slow 


80 Mbytes /sec 


11.5x10° 


48. 0x10° 


LE 


STARAN E 
256x256 
64 pages 

0.5 Mbytes 

80 Mbytes /sec 
Yes 

Fast 

215 Mbytes /sec 


15.4x10° 


60. 6x10° 


Microcomputer Array Processor (MAP) 


Goodyear Aerospace's activities in electronic 
warfare led to a number of studies of parallel 
processing for the EW requirement. These efforts 
led to the MAP digital processing system designed 
for electronic warfare applications. This system 
is comprised: of two major subsystems: a preproces- 
sor that is a digital tracking device and a multi- 
processor that is a programmable computer sub- 
system. The preprocessor compares each digitally 
encoded radar pulse intercepted by the receiver 
system against a file of emitters being tracked by 
the preprocessor. Limit searches of frequency, 
pulse width, and angle of arrival as well as PRI — 
tracking are used in the association process be- 
tween intercept and emitter. The current feasibil- 
ity model of the preprocessor is a microprogram- 
mable device that can process in excess of 300,000 
intercepts per second from several hundred emit- 
ters in real time. Expansion to several million in- 
tercepts per second from a thousand emitters is 
possible. | 


The multiprocessor subsystem finds emitters 
among the intercepts that fail association in the 
preprocessor. This subsystem (Figure 10) con- 
sists of a number of independent processors that 
concurrently work on the emitter establishment 
problem. Each processor is a 32-bit programmable 
computer with its own dedicated memory and a 
capability to execute approximately four million 
instructions a second. In addition to the dedicated 
memory, each processor can communicate with 
numerous banks of global memory. The various 
global memory modules and their communication 
structure serve to tie the individual processors 
together in a symmetrical multiprocessor computer 
architecture. The multiprocessor system is modu- 
lar and can contain as few as two and as many as 
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associative processor. 


eight processors coupled with from 1 to 16 banks 
of global memory. A 32-million instructions per 
second execution rate is achieved. Expansions 
beyond these limits are possible if every proces- 
sor does not have to access every global memory 
module. A four-processor system (with three 
banks of global memory) was installed at Wright 
Patterson AFB in 1979 for use by the Air Force 
Avionics Laboratory. This system can execute 
approximately 16 million instructions per second 
and support a memory access rate of 20 million 
words per second. | 


Airborne Surveillance 


The capabilities of associative processing led 
to a study of its potential in the airborne surveil- 
lance environment. The study showed that the 
associative processor could augment a conventional 
airborne processor. Many of the inherently paral- 
lel functions such as report correlation, tracking, 
and display processing could be performed in the 
3 Processing throughput 
could be increased by more than an order of mag- 
nitude. 


A second study demonstrated the expected 
benefits. This was accomplished by interfacing 
a STARAN E to the host computer and by pro- 
gramming the machine to carry out many surveil- 
lance functions. A parallel effort was conducted 
to develop a processing element chip. The chip 
development was necessary to realize the associa- 
tive processor capability within the very limited 
space (less than 0.5 cu ft) and power (less than 
320 watts). The chip, using CMOS/SOS technol- 
ogy, was successfully fabricated by Rockwell 
International and demonstrated 11 months after 
the development was started. 


Goodyear Aerospace is currently under con- 
tract from Grumman Aerospace to design and build 
a number of prototype ASPRO units. 


ASPRO Or ganization 


A block diagram of ASPRO(21) is shown in 
Figure 11. ASPRO is divided into five function- 
al subsystems: : 


1. Control memory contains both program 
and buffer memory and is also connected to a host 
computer to allow for data, control word, and 
status transfer. 

2. Program execution control is responsible 
for maintaining the correct program flow by execu- 
ting program branches and returns as required 
and establishes correct timing and interlocking of 
the operation to be performed as defined by the 
instructions. ) | | 

3. Data path contains the working registers 
and an arithmetic unit. All data to and from the 
control memory and/or the array is passed through 
this portion. 

4. Array control identifies the array oper- 
ation to be performed and supplies correctly syn- 


chronized control signals to the array. 
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Fig. ll - Block Diagram of ASPRO 


5. Array unit is made up of 17 array mod- 
ules. Sixteen modules of 128 words each make up 
the 2048-word array. The spare module may be 
switched in should one of the basic modules be 
found in error. Each module includes a 128-word 
by 4096-bit array of solid-state multidimensional 
access (MDA) storage and 128 processing elements 
(PE's). 


The array consists of four basic components: 
array memory, flip or permutation network, pro- 
cessing elements, and resolver. Access can be 
made in either the bit or word direction, depend- 
ing on the mode bit of the instruction. 


Massively Parallel Processor (MPP) 


In December 1979, NASA Goddard awarded a 
contract to Goodyear Aerospace to construct a 
massively parallel processor (MPP) to be delivered 
in the fourth quarter of 1982. The MPP was de- 
veloped for image processing satellite data. The 
expected workload is between 10? and 1010 opera- 
tions per second. | 


The major components of MPP are shown in 
Figure 12. The array unit (ARU) processes ar- 
rays of data at high speed and is controlled by the 
array control unit (ACU), which also performs 
scalar arithmetic. The program and data manage- 
ment unit (PDMU) controls the overall flow of data 
and programs through the system and handles 
certain ancillary tasks such as program develop- 
ment and diagnostics. Three staging memories 
buffer and reorder data between the ARU, PDMU, 
and external (host) computer. 
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Array Unit 


Logically, the array unit (ARU) contains 
16,384 processing elements (PE's) organized as a 
128 by 128 square. Physically, the ARU has an 
extra 128 by 4 rectangle of PE's that is used to re- 
configure the ARU when a PE fault is detected. 
The PE's are bit-serial processors for efficiently 
processing operands of any length. The ARU has 
a very high processing speed (Table 3). The 
bandwidth between PE's and memory is 1.6 x 1011 
bits per second. 


A study showed the desirability of making 
edge-connectivity a programmable function. The 
top bottom and right-left edges can either be con- 
nected or left open. A spiral mode connects the 
16,384 PE's together in one long linear array. 


| I/O for the array is up to 160 Mbytes per 
second and can be transferred through the ARU 
I/O ports. Processing is interrupted for 100 nano- 
seconds for each bit plane of 16,384 bits trans- 
ferred - less than one percent of the time. The 

96 boards of the ARU are packaged in one cabinet. 
Forced-air cooling is used. 


Array Control Unit 


Like the control units of other parallel pro- 
cessors, the array control unit (ACU) performs 


TABLE 3.- SPEED OF TYPICAL OPERATIONS 
Execution | 
Speed* 


6553 
4428 
430 | 


_— Addition of Arrays 


8-bit integers (9-bit sum) 
12-bit integers (13-bit sum) 
32-bit floating-point numbers 


Multiplication of Arrays 
(Element-by-Element) 


1861 


| &bit integers (16-bit product) 
12-bit integers (24-bit product) 910 
| 82-bit floating-point numbers 216 


Multiplication of Array by Scalar 


8-bit integers (16-bit product) 
12-bit integers (24-bit product) 
32-bit floating-point numbers 


*Million operations per second 


scalar arithmetic and controls the PE's. It has 
three sections that operate in parallel: PE control, 
I/O control, and main control. PE controls per- 
forms all array arithmetic of the application pro- 
gram. I/O control manages the flow of data in and 
out of the ARU. Main control performs all scalar 
arithmetic of the application program. This ar- 
rangement allows array arithmetic, scalar arith- 
metic, and input/output to be overlapped for mini- 
mum execution time. 


Program and Data Management Unit 


The program and data management unit 
(PDMU) controls the overall flow of programs and 
data in the system (Figure 12). Control is from an 
alphanumeric terminal. The PDMU is a minicom- 
puter (DEC PDP-11) with custom interfaces to the 
ACU memories and registers and to the staging 
memories. The operating system is DEC's RSX- 
11M real-time multiprogramming system. 


The PDMU also executes the MPP program- 
development software package. The package in- 
cludes a PE control assembler to develop array 
processing routines for PE control, a main assem- 
bler to develop application programs executing in 
main control, a linker to form load modules for the 


ACU, and a control and debug module that loads 
programs into the ACU, controls their execution, 
and facilitates debugging. This package is written 
in Fortran for easy movement to the host computer. 


Staging Memories 


The staging memories reside between the wide 
I/O ports of the ARU and the PDMU. They also 
have a port to an external (host) computer. Be- 
sides acting as buffers for ARU data being input 
and output, the memories reorder arrays of data. 


Arrays of data are transferred through the 
ARU ports in bit-séquential order. That is, the 
most (or least) significant bit of 16,384 elements 
followed by the next bit of 16,384 elements, follow- 
ed by the next bit of 16,384 elements, etc. Re- 
ordering is required to fit the normal order of sat- 
ellite imagery in the PDMU or the host. Thus the 
staging memories are given a reordering capabil- 


ity. 


The large multidimensional access staging 
memory uses 1280 dynamic RAM circuits for data 
storage and 384 RAM's for error-correcting-code 
(ECC) storage (a 6-bit ECC is added to each 20- 
bit word). Initially, the boards will be populated 
with 16K bit RAM's for a capacity of 2.5 Mbytes. 
The memory can be programmed to input and out- 
put imagery in a wide variety of formats. 


The Future 


Current efforts in ASPRO, MPP, and MAP 
will yield improved processing systems in those 
areas where parallelism can be effectively applied. 
The breadth of application seems quite wide. A 
number of users have effectively converted "clear- 
ly sequential processes" into parallel algorithms 
for parallel solution. 


We see smaller, more powerful parallel pro- 
cessors occupying less space and using less power 
being developed in the near future. We see the 
parallel processing technology as a most cost ef- 
fective tool for real-time command and control, 
and other data base management tasks. 


We haven't satisfied all the thoughts posed by 
Dr. Bush, but a first step is readily implemented 
in today's parallel processors. That step is the 
virtual elimination of the elaborate indexing struc- 
ture required in today's processing systems. 
We've reduced "our ineptitude in getting at the 
record .. . largely caused by the artificiality of 
systems of indexing." 
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Abstract 


The circumstances surrounding the conception 
and development of centrally-controlled array 
processors are described. The period of time 
involved is from 1953 to 1975. It brackets the 
Westinghouse SOLOMON systems, their precursors and 
the University of Illinois ILLIAC IV. Some 
reflections on past and possible future interplay 
between university and government laboratories on 
the one hand and industry on the other are made at 
the conclusion. 


The First Stirrings 


In June 1952, with a new bride and new 
Masters degree in Mathematics, I took a job as a 
programmer with the Electronic Computer Project at 
The Institute for Advanced Study in Princeton. I 
had no idea what a programmer was expected to do 
but a school friend, Adolph Nussbaum, who was 
already working there and had arranged my 
apparently successful interview with Herman (H.H.) 
Goldstine, assured me that it was interesting and 
honest work and I did need a job. I left 
Princeton in February 1954 to return to school for 
a Ph.D. and though my stay at the Institute had 
been for only twenty months it played a 
significant role in my development. First, it was 
my initial contact with what became and remains my 
profession and it was the place where my vision 
broadened from the myopia of a young, partially 
cooked mathematician to encompass my still- 
enduring interests in physical science and 
technology. In particular, I had the good fortune 
to learn the rudiments of logic and hardware 
design from members of one of the most capable 
engineering staffs ever assembled. I remember, 
particularly, Leon Harmon, Hugh (Hewitt D.) Crane 
and Julian Bigelow as tutors. It was also at 
Princeton. that I first thought of building a 
parallel processor. The idea was stimulated by 
the physical appearance of the magnetic drum that 
was being built to augment the 1,024 word primary 
memory of the IAS machine. The disposition of 
heads and amplifiers over the drum’s 80 tracks (2 


banks of 1,024,40-bit -words) suggested to me the 
notion of, . first, inverting the  bit/word 
relationship so that each track stored the 


successive bits of a single word (in fact, of 
several words) and, second, associating a ten tube 
serial adder with each track so that in a _ single 
drum revolution an operation could be executed on 
the contents of the entire drum. The idea was to. 
do, in parallel, an iterative step in a mesh 
calculation. I. remember, probably under 
influence of the 1,024 word Williams tube memory, 
desiring to build a 1,024 track drum to represent 


(a) *To appear in Annals of the History of Com- 
puting. 
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much at IBM and it was partly, 


Urbana, Illinois 61801 


in this fashion a 32 by 32 mesh. I even had the 
temerity to make it the subject of the only 
conversation I had with von Neumann that didn’t 
concern itself with one of the mundane programming 
chores that I occasionally did for him. He 
considered the idea for perhaps half a minute when 


he said that he thought it would require too many 
tubes. It was not a devestating personal setback 
because it had been the work of only a few 


evenings and some casual conversation with Harmon 
and Cr ane. I essentially considered final 
judgement to have been rendered and didn’t 
seriously take up the idea again for roughly five 
years. During those five years I completed a 
Ph.D. degree (in applied math at what is now 
called The Courant Institute in New York) spent a 
year on a _post-doc at Princeton (the University 
this time) and succumbing to the lure of action 
and regular meals took a job at IBM in 
Poughkeepsie. 


I don’t recall the immediate stimulus to 
taking up the idea again but know that it was 
early in 1958 as a member of the Computer 
Organization Department in the Research Labs. 
This Department, under W.J. Lawless and M. Clayton 
Andrews, was an environment that invited, in fact 
demanded, far ranging frontier exploration. The 
technological scene had by this time totally 
turned over; from receiving tubes and CRT _ stores 
to transistors and cores. My serial adder now 
became a small circuit board and the PE memory a 
core plane. I did some designs working on my own 
and began to think more generally about how one 


would need to modify algorithms to utilize 
parallelism. These thoughts and a few corridor 
conversations with my friend and colleague John 


Cocke resulted in my writing up [Cocke 1958]. 
This report describes, in some detail, including a 
derivation of the O(logyn) speed up where n, the 
degree of the polynomial, is assumed to be less 
than the number of-processors. It also suggests 
parallel algorithmic approaches to the solution of 
ordinary and partial differential equations. My 
efforts in parallel computation never amounted to 
but by no means 
exclusively, for this reason that I left IBM in 
June 1960 for what was to be aé_e strange but 


important interlude lasting only. 8 months at the 
newly formed Baltimore Aerospace Division of an 
old airplane manufacturer, Aeronca Manufacturing 
Corporation. What lured me to Aeronca was the 
promise that I could pursue my own ideas on 
parallel computers, the well-chronicled IBM 
frustration syndrome and a large raise in pay. 


It would be wrong to say that IBM played no 
in the development of my ideas but the role 
largely indirect. It was at IBM, while 
Rex Rice on the design of a small 


role 
was 
working for 


general-purpose computer, that I really learned 
the trade. My debt to Rex is great, though when I 
talked to him about parallel computers, although 
he listened, it was not always overly patiently.. 
He even witnessed some of my drawings 
then as always, to the great good fortune of the 
computer field, had his own fish to fry- and I got 
paid to help him fry his fish, not conversely. 
Rex will appear again, in an important way, much 
later in this story. Of the IBMers, Lawless alone 
showed some interest in my ideas and was. the 
first, in 1959, to alert me to the work of Konrad 
Zuse, which to my only mild embarassment I confess 
I have still never looked at, who described a 
drum-based equation solver (perhaps similar to the 
machine I had thought about in Princeton) which is 
described in [Zuse 1958]. Regretably, I am here 
engaging in the dubious scholarly practice of 
quoting a non-primary source, for the only work I 
have even seen that refers to this Zuse paper (it 
gives a 1 line description) is [Kuck 1978] and ny 
attempts to get the reference, it is in German, 
from the 4 libraries in the U.S. which allegedly 
possess it or from a colleague who is "looking for 
his copy'' have thus far been unsuccessful. 


Now for Aeronca and the almost-real world. 
people stand out from this interlude. One is 
Dr. Gordon J.F. MacDonald, who as a visiting 
scientist at the Goddard Institute for Space 
Studies, then in Silver Spring, Md., gave me both 
my first real encouragement and my first research 
contract; both were important. Moreover, Gordon 
understood critically and in precise detail what I 


Two 


was talking about. His support was thus 
particularly meaningful. With the help of this 
contract I did the first, fairly complete overall 


design and detailed an enriched Processing Element 
(PE) (I started calling them "elements" because 
large numbers of "elements" seemed somehow easier 
to take than large numbers of "units") to gate 
level. As a consequence of this contract, 
unforeseen to me, the government acquired a 
permanent license which permitted me to continue 
the work, under government sponsorship, after I 


left Aeronca. 

The second of the two people is Chuck 
(Charles N.) Valenti. Chuck was the salesman who 
was given the responsibility of trying to help 
this innocent Ph.D. find an agency that would 
recognize some sort of stake in my research. 
Chuck had, particularly for a salesman, a rare 
combination of attributes. He was a true believer 
in the capitalist system, a patriot, very smart 
and essentially honest. He believed, and 
convinced me, that if you couldn’t sell it then it 
probably wasn’t worth all that much. He also knew 
the DOD like the back of his hand (it was Chuck 
who first revealed to me, for example, that every 
fourth door in the Pentagon was the entrance to a 
men’s room). Trailing Chuck around and giving my 
pitch while he opened doors, watched, criticised 
and schemed gave me my first glimpse of how things 
worked in the complement of IBM, which I had 
previously considered barely non-empty. 


The reason I left Aeronca still strikes me as 


but Rex, 
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amusing. The Technical Director, to whom I 
reported, had singularly eclectic interests 
comprehending the importing of spaghetti making 
machines and marble, a housing development 
corporation, ESP and ae process’ developed by a 
retired colonel in Pennsylvania for ridding crops 
of all manner of blights by simply placing, for 
example, a leaf from a plant in the troubled field 
on top of a very special box of the venerable 


colonel’s design. I was assigned the task of 
investigating this phenomenon for possible 
commercial exploitation. I examined the circuit 


schematic and found that, among its many unusual 
features, it seemed to function without a _ power 
supply and no detectable closed circuit. I 
reported that I thought the whole business 
preposterous nonsense and took the opportunity to 
also express my dim view of ESP as a means of 
secure battlefield communication. Although I 
wasn’t fired on the spot the strain in the 
situation grew worse rapidly and I soon left, 
having arranged for a job with the nearby Air Arm 
Division of Westinghouse at Friendship Airport, 
Maryland (all the names of nearly everything have 
long since changed). It was at Westinghouse that 
things finally began to take off and I will 
discuss my four years there in some detail. 


SOLOMON 


I was hired by the Engineering Manager, Harry 
Be. Smith, a first-rate radar man and an excellent 
manager. We agreed that I would be based in an 
existing computer development group that had some 
good people in it and some substantial 
accomplishments to its credit- primarily in the 
area of airborne analog computers. I was, from 
the beginning, however, given the freedom and 
wherewithal to pursue my own ideas. I followed up 
some of the contacts Valenti and I had initiated 
and quickly secured support from the Rome Air 
Development Center (RADC) and the United States 
Army Signal Corps Laboratory at Fort Monmouth, New 
Jersey. The principals I dealt with were Al (Alan 
A.) Barnum, Morris (A.) Knapp, and Bill (William) 
Moore at Rome and Dave (David) Haratz, Milt (M.A.) 
Lipton and Dr. Ed (Edward) Reilley at Fort 
Monmouth. Within 3 or 4 months from the time I 
joined Westinghouse I had started a small group 
with 2 young engineers who were the first to work 


with me on developing the SOLOMON design. They 
were Carl (W.C.) Borck and Bob (Robert C.) 
McReynolds. Carl, Bob and I then spent a most 


productive and gratifying year working out the 
design and some programming details which we 
reported first in a Workshop on Computer 
Organization [Slotnick 1963] held at Westinghouse, 
under RADC and Westinghouse Sponsorship in October 
1962. Carl, Bob and I later presented a more 
detailed design article [Slotnick 1962] at the 
1962 Fall Joint Computer Conference which to. our 
surprise and pleasure won the first AFIPS Prize 


and which, together with a companion report on 
applications [Ball 1962], became the standard 
citations for SOLOMON. A word about the name 
SOLOMON before discussing the design; it was 


suggested to me because of both the (wise as) King 
Solomon connotation and his 1000 wives (servants, 


in a ruder and simpler era). SOLOMON was designed 


to have 1000 (OCTAL) Processing Elements (PE’s). 
I had no acronym in mind. Much later the tortured 
Simultaneous Operation Linked Ordinal MOdular 


Network was devised by a creative salesman, ~ Jerry 
McKindles with, I must shamefacedly confess, my 
help. The final design of SOLOMON (later, as we 
shall see, to be called SOLOMON I) was reported in 
[Gregory 1963]. By then I had a group of 12 
engineers, under John (J.G.) Gregory, who so 
creatively and energetically supervised the later 
Westinghouse design and development work. Some of 
the others who figured prominently were Bill 
(W.W.) Beydler, Art (A.B.) Carroll, Marv (M.G.) 
Graham, Ed (E.R.) Higgins, Jim (J.R.) Hudson, Bill 
(W.H.) Leonard, George Shapiro, Dave (D.K. ) Sloper 
and Bob (R.M.) Trepp. 


In discussing the design I will utilize 
figures from both [Slotnick 1962] and [Gregory 
1963]. The main ideas remained the same 


throughout the SOLOMON and, in fact, the ILLIAC IV 


program. These were of a PE array controlled from 
a central source, as shown in Figure 1. The 
program store was associated with the central 


control while operand storage was in the array as 
shown in the Processing Element block diagram of 
Figure 2. Each PE possessed a memory composed of 
two core frames. The two-address system employed, 

used a frame for each operand and wrote the result 
over one of them. Operands could = also be 
broadcast from the central memory or come from 
(and results go to) the 4 nearest neighbors of any 
PE. This was essentially the drum design of the 
Institute days with the drum tracks replaced by 
the core frames. The connectivity was the same 


but I added the ability to wrap around the extreme 


rows and/or columns under program control. Also 
new were the Geometric Control Registers, shown in 
Figure 3, which permitted the selection of rows, 
columns or row/column intersections by number. A 
special buffer, the L-buffer did matching between 
the conventionally or ganized word-oriented 
processor of the central control unit and the 
serial-by-bit PE’s. That is, words were handled 
in parallel (serial by word, or block of words) 
between L-buffer and central control. but serial by 
bit (all the kth bits of a column of PE words) 
between L-Buffer and array. The core frames were 
assembled into stacks, as shown in Figure 4, which 
shared address decoding and drivers. Thus, the 
Same address was selected for all the frames in a 
stack. In ILLIAC IV the technology was by then, 
as we'll see, able to permit independent PE memory 
addresses. ; 


What Carl Borck named Mode Logic was to 
remain the main "local" (applicable to a PE, as 
opposed to "global"-applicable to the whole array 
via central control- terms 
mathematics because they fit so well) 
capability in all subsequent array 


logical 
computers. 


Each PE had a 2 bit register in which a local 
data-dependent "mode value" could be set. Each 
instruction had a corresponding 4 bit field 


specifying any subset of the 4 possible mode 
states which were "allowed" to participate in that 
instruction. That is, only PE’s in one of the 
mode states specified in the instruction could 
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PE’s). 


The 
resulted from these efforts. 


this 


change states during that instruction cycle. PE’s 
in an "off" mode state could still provide 
operands to a neighbor if called upon to do so, 
and thus could influence the state of an "on" PE 
but an off PE could not itself change state. This 
very simple, almost irreducible residue of local 


control usually functions adequately as the 
principle array conditional branch. Word length 
was variable under the control of a _ setable 
register. This was easy in a bit serial system 


and necessary during this era of expensive memory. 


The PE’s were grouped into subarrays of 32x8. 
A system could contain up to 8 subarrays (2,048. 
did many design studies with 
independent subarray controllers as we did with 
numer ous redundant (for reliability) 
configurations of PE rows, columns and subarrays. 
optimistic 4 quadrant ILLIAC IV design 


We 


We never built a full-scale SOLOMON but did 
build significant experimental models; a 3 by 3 
model with complete PE’s, a 10 by 10 model with a 


somewhat different PE and numerous’ special 
breadboards to measure electrical characteristics 
in order to optimize signal distribution within 


the array and between the array, the central 
control and 1/0 units. 
During 1962 and 1963 the PE evolved, with 


continued integrated circuit evolution, from the 
simple bit-serial element to a 24 bit element with 
24 bit registers and byte-oriented PE arithmetic 
hardware. We built experimental PE’s which added 
24 bit numbers in 3.4 ysecs and multiplied them in 
less than 20 ysecs (using 10 mh circuits). Even 
programmed floating point times became reasonably 
respectable. The basic PE system module became 
smaller (256 to 32 PE’s) as the PE itself became 
larger (1 to 24 bits). In ILLIAC IV this was to 
become modules of 64, 64 bit floating point PE’s. 


For reasons soon to be clear the details of these 
later SOLOMONS (by now called by us at 
Westinghouse, SOLOMON II) were never published. 
They can be found, however, in [Westinghouse 
1964a] and [Westinghouse 1964b]. 

Problem analysis and. programming. tried 
breathtakingly to keep pace over this thirty 


months or so of evolution of designs and models 
and. simulations but except for some particular. 
problems in partial differential equations, matrix 
inversion and a significant group in signal 
processing, never really managed it. We convinced. 
ourselves early that the problem space for which 
computer organization excelled was large 
enough to justify development of the full-scale 
machine, and, indeed, it nearly did. Tn 
programming we designed some new language 
constructs to overlay on a standard higher-level 
language, such as DO TOGETHER, but these likewise 
were to remain incompletely implemented. 


At the beginning of 1964 RADC was pushing 
hard and with apparent success for a DOD program 
to build a full-scale SOLOMON but in March the 
situation turned totally sour. Our principal DOD 
Washington sponsor drowned in a tragic accident 


and the program’s chief.‘ opponent, an old-line 
affiliate of an industrial competitor who was soon 
to leave DOD under’ something of a cloud, acted 
quickly to kill our chances. By this time I had 
assembled a group of 100 or so mouths to feed, 
about half of whom were working on SOLOMON (I had 
taken on additional responsibilities) and had to 
shift gears quickly to avert a total wipe-out. 


Thus began the visits to the Atomic Energy 
Commission’s Lawrence Radiation Labs at Livermore, 


California and Dr. Sid (Sidney) Fernbach. Sid 
labored mightily to stave off a disaster at 
Westinghouse. He wanted to see the technique 


tried on a decent scale but failing to squeeze the 
development cash out of Washington, he managed 
only a contract offer to lease a system were 
Westinghouse to develop it on its own funds and‘ I 
couldn’t argue Westinghouse into accepting it. 
The 6 months between the shut-off of most of our 
DOD funds and the AEC’s unacceptable contract 
offer remain a frenzied blur with Sid’s lasting 
encouragement and friendship as its only redeeming 
feature. 


When the Group Vice-President at Westinghouse 
turned the contract down I, of course, instantly 
submitted my resignation which he turned aside in 
a speech so full of understanding and concern that 
I managed not to kill him. The Westinghouse 
Baltimore executives acting with their typical 
very considerable skill had me dismantle my group 
and then offered me the job of manager of Advanced 
Development, a group of 1,000 engineers plus a big 
support operation, which included the offer of a 
home for most of the people in my SOLOMON group. 
I accepted it as a means to resettle my group. I 
also tried for several months after the debacle to 
start my own company and raise the venture capital 
to build a machine for Sid but I couldn’t quite 
pull it off although I had pledges for about three 
quarters of the money when I took another turn. . 


I had gotten to know John (R.) Pasta, head of 
the Computer Lab at the University of Illinois, 
when he invited me to join an informal group that 
met periodically to discuss directions in the 
computer field. He knew what was happening to me 
and encouraged me in many ways to reflect on the 
abiding nature of intellectual achievement as 
Opposed to the transient goals I was becoming 
obsessed with. The academic life looked good, 
indeed, and after checking the othet academic 
possibilities I decided to join John at Illinois 
because of their outstanding reputation in having 
not only used but built machines and because of my 
friendship and respect for John. In May of 1965 I 
moved my family to Urbana. . 


ILLIAG IV 


_ The SOLOMON work had essentially come to an 
end at Westinghouse a year before I left for 
Illinois and that year was spent frantically and 
unsuccessfully looking for ways to get it going 
again while simultaneously taking on a tough new 
job assignment. When I arrived in Illinois I was 
anxious to do something else, in fact, anything 
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whom I had visited at 
Information Processing 


else. Ivan Sutherland, 
ARPA, where he headed the 
Techniques Group, called and asked to visit me 
about a month or so after I came to Illinois. 
Without really expecting anything to come out of 
it other than a pleasant visit with a very bright 
and genial colleague, I of course happily set it 
up. As it turned out, what was on his mind was to 
see if I would be interested in. developing a big. 
parallel computer at Illinois. I thought about 
it, talked with John Pasta who was more than 
agreeable and, not without some "here we go again" 
trepidation, I told Ivan yes. 


Ivan wanted to start the project with a small 
study phase but I absolutely refused. I wanted a 
two million dollar payment at the outset and a 
contract for a total of ten million. I did this 
to make’ sure that the ARPA committment was real 
and had passed the highest levels of review. Ivan 
agreed. I wrote the proposal and a few weeks 
later we had our contract. John and I decided 
with much regret that the days when a_ university 
could design and fabricate a big machine, by 
itself, were over and we decided that while we 
would do the architectural design and most of the 


software and applications work, we would rely on 
industry for detail design and fabrication. 
For nearly a year I gathered a nuclear. staff 


and worked to develop design specifications for a 


study phase procurement to be followed by the 
fabrication phase. I outlined the major 
approaches in [Slotnick 1966] and we incorporated 
them in a bid set. In August 1966, after many 
months of intensive contacts with industry, three 
8-month contracts were awarded to RCA, Burroughs 
and UNIVAC. In April 1967 Burroughs was’ selected 
to go on to do the final design and development. 
The selection of Burroughs, while not quite by 


default, was hardly hotly-contested. Burroughs had 


teamed with Texas Instruments, who were to develop 
the integrated circuits for the PE. even though 
they were, at the same time, building their own 
high-performance pipelined system. They (TI) had, 


in fact, tried to interest us in abondoning the 
parallel approach in favor of a pipeline. Control 
Data also had a whack at this. But it seemed 


pointless, from any point of view and, in fact 
impossible from mine, to think of developing three 
pipeline processors and no parallel processors. 


The technical details of ILLIAC IV are quite 
well known; the standard citation is [Barnes 
1968]. I will thus concentrate here on_ those 
surrounding circumstances and issues which most 
influenced the program. 


My original intention was to build a _ system 
consisting of four subarrays (quadrants) of 64 
PE’s each; a PE now being a 12,000 gate floating 
point (48 bit mantissa-16 bit exponent) processor. 
The 4 quadrants could each operate independently; 
the system thus acting as a 4 unit multiprocessor. 
The system could also operate as 2-2quadrant units 
or as a single 4 quadrant (256 PE) unit. Each PE, 
moreover, could function as a single 64 bit 
floating point element as. 2 - 32 bit floating 


, 


‘point elements (24 bit mantissa- 8 bit exponent). 


or as 8 - 8 bit character-oriented fixed point 
elements; this last operating mode being directed 
toward signal and image processing applications. 
As we will see, by 1969 cost overruns made it 
necessary to reduce the size of the system to the 
single quadrant, described in ([Slotnick 1971], 
that was built. (It is a quaint observation that 
the space still exist in the ILLIAC IV back panels 


to plug in the connectors to the missing three 

quadrants) The PE, however, underwent no 
significant change in logical organization. | 

There were two main contributions to the 

early overruns and consequent retrenchment. The 

first was the inability of Texas Instruments’ to 


produce the 64 pin ECL packages around which we 
had designed the PE. A great deal of inconclusive 
mutual finger pointing went on between TI and 
their suppliers but the upshot was the loss of 
somewhat more than a year of time and, all related 
and consequent expenses considered, perhaps 4 
million dollars. The second severe setback was 
the inability of Burroughs to produce the magnetic 
thin film PE memories. In time and dollars this 
amounted to the loss of roughly an additional 2 
million dollars and a year further delay. | 


We retreated from the 
standard 16 pin dual 


64 pin packages to 
in line packages. In so 


doing, however, everything got bigger and more 
expensive. A lot of the logic design was 
salvageable as a consequence of making the 16 pin 
packages logically derivative of the abondoned 64 
pin packages but board layout, back-panel wiring 
and all system-level hardware had, of course, to 
restart from scratch. The memory situation was 
even messier. Burroughs had a large investment in 
their thin films and didn’t want to give up on 
them even after my own and independent review had 


concluded that they still represented an 
intolerable development risk. Semiconductor 
manufacturers were just beginning to gear up for 


memory chip manufacture; the manufacturing means 
were clearly at hand, or at least so it seemed to 


me, but the chips were not. I made the painful 
decision to drop films and go with semiconductors. 
After making the rounds of all the possible 


qualified suppliers I recommended the selection of 


Fairchild Semiconductors, whose memory systems 
operation was then headed by Rex Rice, under whom 
in a former life I had, as I have already 
discussed, learned the tricks of the trade. This 


selection met with considerable opposition from 
Burroughs and others. Such was the furor in fact, 
that ARPA, convened a panel of independent experts 
who carefully reviewed the situation and sustained 
my decision. It turned out that when Fairchild 
delivered their memory it was still the only 
high-speed semiconductor memory being delivered 
and that ILLIAC IV was the first large system to 
employ a semiconductor primary store. | 


By this time Ivan Sutherland had sought other 
if not greener pastures and the combination of 


Larry (Lawrence G.) Roberts and Bob (Robert) 
Taylor had replaced him in the ARPA computer 
operation. To Larry, without doubt, goes the 


distinction (however dubious) of having shed the 
second greatest amount of blood for ILLIAC IV. 
Larry had several set of interests in the machine. 


First there were the direct applications that had 
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been identified including numerical weather 
prediction, sonar, radar and seismic signal 
processing and the, by now, usual list of 


computations that array computers do efficiently. 
Larry also shared my sick attachment to really big 
pieces of hardware. But in ILLIAC IV Larry also 
had the interest of the father of computer 
networks. Particularly, as the program’s costs 


escalated from my initial 10 million dollar guess 


(the sometime alleged relationship to the Sonny 
Liston-Cascius Clay (now Muhammed Ali) fight gate 
receipts is neither totally true nor totally 
apocryphal) to the more than 30 million it ended 
up costing for one fourth of the initial system. 
To justify these escalated costs the ultimate 


availability of the machine to an ever larger 


‘community of users became mandatory and ILLIAC IV 


and the ARPANET became inseparably linked. 


In 1969 the strains of 
dollar per month project within an academic 
department were operating destructively. The 
relations between John Pasta and me and between 
ILLIAC IV and non-ILLIAC IV faculty generally had 


running a million 


degraded beyond the point of repair and the 
project was made a free-standing Center in the 
Graduate College of the university. It was a 


terrible mess, due mostly to circumstances but in 
no small part to me, and I regret to this day the 
human hardships that resulted and the deep human 
relationships that were destroyed. 


By the beginning of 1970 the hardware program 
had been marginally restabilized; the PE had been 
redone with the new lower level T.I. integrated 
circuits, PE memory chips were being delivered by 
Fairchild, boards were being produced at a decent 
rate, cabinets and cables existed and our internal 
departmental conflicts were coming to a head on a 


campus and in a country that was becoming unglued 
by the Vietnam War. In May of 1970 both the U.S. 
action in Cambodia and the Kent State shootings 


took place and my generally conservative position 
with regard to the war became untenable to me. I 
informed the university administration and ARPA 
that I thought it wrong for the ILLIAC IV to be 
installed and operated on the campus and that if 
it was I would play no part in it. The reaction 
by campus administrators was consistent with all 
my previous observations: They ranged from the 
proposal that the future location of the machine 
be decided by a binding student election, to 
delivering the machine to a profit company to be 
set up near campus and protected both by high 
walls and armed guards. In the presence of 
continuing demonstrations, frequently violent, by 
as many as 6 or 7 thousand students which were 


sharply focused at the project, these suggestions 


by my "superiors" only supported my conviction. 


Due mainly to the efforts of Dr. Hans Mark, 
then Director of Ames, ARPA decided to ship the 
machine to the NASA Ames Research Center in 
Moffett Field, California, announcing their 


decision on 29 January 1971. In April 1972 
Burroughs delivered the system to Ames. It was 
plagued by a variety of serious hardware problems. 
Some of the early circuit batches failed both at 
‘high rates and in modes troublesome to detect. 
Moreover, huge numbers of back-panel connections 
and of terminating resistors were equally bad. 
Although some successful runs were made as early 
as 1973, the machine wasn’t running reliably until 
1975. The story of ILLIAC at Ames is, however, 
not mine to tell. 


The story of the ILLIAC IV hardware is, 
however, only a part of the ILLIAC IV story and 
perhaps not the most important parts From :1965 
on, as a result of the ILLIAC IV program, first 
the Urbana campus and subsequently many other 
university, government and industrial laboratories 
have undertaken. the analysis of the relations 
between computer architecture, algorithms, both 
numerical and non-numerical, and their expression 
in programming languages. The problems posed so 
long ago in [Cocke 1958] have begun to receive the 
attention that I believe they merit. Moreover, as 
I will presently discuss, parallel computation (or 
pipelined computation with which it shares many of 
its benefits and burdens) will not be the sole 
beneficiary of this attention. My own opinion is 
that the greatest advances in the efficient use of 
new architectures will accrue from research in 
numerical algorithms and that the benefits yielded 
by new languages will continue to primarily 
benefit the sanity and efficiency of the 
programmer and be of only secondary concern to the 
programee. 


This orientation certainly influenced my 
priorities in running the ILLIAC IV _ program. 
While I think the ILLIAC IV language development 


work of Dave (David J.) Kuck and Duncan (H.) 
Lawrie has had lasting value and has influenced 


many other researchers, my main concern remains 
with the part of the problem solution process 
that, though cognizant of the machine logical 


structure, goes on before the programming starts. 
The research of Ahmed Sameh, whose distinguished 
career also started with ILLIAC IV, exemplifies 
this position. We have discussed these matters at 
some length in [Slotnick 1978]. 


It is time, in concluding this section, for 
the pleasurable business of acknowledging some of 
the ILLIAC IV principals. It is also a risky 
business because there are doubtless some I’1l 
forget and others I’d just like to forget. I’ve 
already mentioned Kuck, who launched the language 
and application programming efforts and the one 
man gang Duncan Lawrie who produced GLYPNIR, which 
for ten years was the only working ILLIAC IV 
higher-level language, and did it by himself as a 
lowly graduate assistant.. I’ve also mentioned 
Sameh who essentially initiated, and remains a 
principal contributor to, the field of parallel 
algorithms for calculations in linear algebra. 
Bob (Robert S.) Northcote made contributions in 
software as did a group of some of the brightest 
kids (many of whom now as middle-aged men remain 
among my friends and collaborators) I’ve ever 
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known, including Pete (Peter A.) Alsberg, Gary 
(G.) Grossman, Dave (David M.) Grothe, Nelson 
Machado, Jimmy (James M.) Madden and Jim (James 
E.) Stevens. : 

the 


On the hardware side there was constant 


support of the many-sided Art Carroll and Dave 
(David E.) McIntyre. Masao Kato’s energy and 
skill was an inspiration to all of us. For 


several of the early years Dick (Richard M.) Brown 
helped hold things together and there was always, 
at the center of the storm, Frieda Anderson. At 
Burroughs (E.) Gary Clarke, with whom I share 20 
years of richly varied memories, started things 
off and remained the Godfather, the only one I 
could always talk to was Vern (Vernon Z.) Smith 
who also knew the best jokes. Dick (Richard A.) 
Stokes and George (H.) Barnes did creative 
technical work. At T.I. Harvey (H.C.) Cragon and 
Joe Watson led my good list and (J.) Fred Bucy was 
on it. I also had another list. 


Reflections 


I have a bit more to say about parallel 
computers. First, one of the most interesting 
directions remains insufficiently explored. In 
[Slotnick 1970] I outlined the simple notion of 
associating with each track of a rotating store 
some minimal logic capacity, much like the SOLOMON 
I PE. The resulting system I called a logic per 
track disk in evident generalization of the head 
per track disk. Such a system could, like my drum 
of old, search or process the complete contents of 


a rotating store every revolution and thus 
function as a calculating, and/or associative 
store. Figure 5 shows the general idea. The PE 


logic which looks at the tracks (I used a pair, in 
strict analogy to the SOLOMON frames and 2-address 
scheme) is just a simple serial bit stream (pair) 
processor with mode and routing logic. Nowadays 
one would want to include a few characters of RAM 
in the PE instead of the revolvers that were’ then 
appropriate. One could also think of replacing 
the disk tracks altogether by CCD’s or bubbles. 
The optimal parameters of a modern serial memory 
hierarchy for such a system would depend strongly 


on the overall system size and application scope. 
My student Stu (Stewart A.) Schuster and his 
Toronto colleagues continued this line of 
investigation, which they described in a sequence 


starting with [Ozkarahan 1975], 
have thus far 


of publications 
but only some small scale models 
been built. 


SOLOMON-like machines have been built by ICL 
[Parkinson 1976] and others. Even now, the MPP 
[Fung 1977] currently being built is a larger (128 
by 128) machine of this same general class. It 
remains to be seen whether parallel machines with 
floating point PE’s are a dead end. I don’t, 
however, consider the question to be of the first 
importance for reasons I will presently get to. 

I believe it has value to reflect on _ the 
aetiology of some of the major problems that 
occurred during the ILLIAC IV development. First 
and foremost, trying to provide technical 


leadership as well as administrative direction to 
a program of this magnitude from a base in a 
traditional academic unit or from anywhere on any. 
university campus made all the sense of trying to 
build a battleship in a bathtub. We had neither 
the facilities nor personnel to manage from a 
distance and even if we did, our temperaments, as 
intellectually driven monomaniacs, demanded being 
in the middle of every significant decision and 
altogether too many insignificant ones. It was 
only at great personal cost that, several years 
into the program, I disciplined myself to the 


slightly lesser of the evils: more global 
management. 

We also took on too much. Some of the 
hattles were unnecessary. I obsessively wanted 


every bit of speed we could get from any source. 
My hindsight is clear, I should have used more 
comfortable technology; our role there was not 
indispensable. By sacrificing a factor of roughly 
3 in circuit speed it’s possible we could have 
built a more reliable multi-quadrant system in 
less time, for no more money and with a comparable 
overall performance level. This same concern for 
the last drop of performance hurt us as well in 
the secondary (parallel disk) and tertiary (laser) 
stores. But this is all looking backward which 
violates my nature. I would rather conclude by 
looking ahead, at new directions suggested by 
current technology, with the benefit of these 
experiences. 


Let me say first that it will probably be 
quite a while before even every Cub Scout Troop 


much less every household has its own design 
automation system with direct links to its 
companion, computer-controlled electron beam 
lithography VLSI fabrication system. In the 


interim such systems will remain the possessions 
of a relative handful of manufacturers. These 
manufacturers will use them to manufacture only 
those systems for which they believe there is a 
sizable market. This is counterpoised to the 
first twenty years of the computer era, when the 
relatively simply attainable state-of-the-art 
development capabilities were shared by industry 
with both university and government labs. The 
result is that technology which intrinsically has 
the capability to launch an unparalleled era of 
system experimentation shows little sign of 
fulfilling this potentiality. While no single 
university can afford the many millions of dollars 
required to create and operate a facility that 
would be capable of turning out strange and 
occasionally wonderful prototypes, it would not be 
too much for an appropriately backed consortium. 
The technology is nearly at hand to permit serious 
experimentation with dozens of new and promising 
computing structures. It is with this 
that I earlier said it didn’t really matter 
whether centrally-controlled arrays with floating 
point PE’s are or aren’t a dead end. There can, 
in the near future, be many 
systems developed to solve this or that particular 
class of large-scale computational problems; on a 
bad afternoon I can think of a dozen myself. The 
question is only whether the capability will 


- potential. 


in mind 


special-purpose — 
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different .in principle 


Processor" 


become a reality. We must ask ourselves would we 
have charcoal-broiled steaks today had Prometheus 
given fire only to the Chrysler Corporation. 


I don’t want to finish without an explicit 
statement of my view of the field’s evolutionary 
First, to think of supplanting the 
primary rolé of the conventionally organized 
(Babbage-von Neumann) computer is nonsense. It 
is, literally, an epoch-making concept. What can, 
however, take place is the evolution of large 
systems (and I, of course, have reference only to 
large systems) to comprehend entire families of 
special- purpose "peripheral devices" in a way not 
than the way they now 
comprehend their library of programs, 
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THE HISTORY OF PARALLEL PROCESSING 
AT BURROUGHS 


Richard Stokes and Robert Cantarella 
Federal and Special Systems Group 


Burroughs Corporation 
Paoli, PA 19301 


Introduction 


Parallel processing in the context of the 
Burroughs experience has been synonomous with 
the development of the “supercomputer”. While it 
is accurate to claim that, throughout the 
Burroughs standard product line, the application 
of parallel processing design is in ample evidence, 
the main stream of the work on supercomputers is 
centered in the Federal and Special Systems 
Group, Paoli, Pa. For almost two decades, the 
challenge of the parallel machine has been actively 
pursued without interruption. In that time a series 
of major systems have been developed, starting 
with ILLIAC IV, then PEPE, followed by BSP; and 
this paper describes the historical events in the 
development of these systems. A new parallel 
design currently under study for NASA called the 
Flow Model Processor (FMP) is not discussed here. 

These machines as a group represent some of 
the most ambitious undertakings in the industry 
(Table 1). With the exception of the FMP, all have 
been completed in a fully working sense, and all 
substantially met their original design objectives. 

As a group they are certainly a tribute to the 
designers whose skills harnessed enormous quan- 
tities of logic and memory circuits in concerted 
processing functions. Their contribution to com- 
puter science has been made, but perhaps not fully 
realized. The design rationale of these machines as 
a machine class (SIMD) provides the only 
demonstrable performance response for that class 
of large scientific applications that have vec- 
torizable programs. 

This 19-year history is intended as a synopsis 
of the plans, events and results of three major 
engineering experiences at the Burroughs Great 
Valley Laboratories. Unfortunately history, like 
art, is seen through the mind of the beholder and 
where serious omissions or errors occur they are 
certainly not intentional. The lessons learned and 
the experience derived from these endeavors are 
continuing to serve our engineering staff in the 
development of the FMP. 
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Table 1. Comparison of Parallel Processor Capabilities 


PEPE ILLIAC IV BSP 
Data Word Size 32 bits 64 bits 48 bits 
Instruction 32 bits 32 bits 24-48 bits 
Word Size 
Backing Store In host Paged to PE N-Mos RAM 
Memory Cycle 100 ns 250 ns 160 ns 
Number of Up to 288 64 16 


Processing Elements 


Processing Element 32-bit floating 66-bit floating 48-bit floating 


point accumu- point accumu- point memory 
lator oriented lator oriented oriented. 
Microprogrammed Yes Yes Yes 
Processing Element Linear array 4 nearest Cross Bar 
Connections neighbors 
Parallel Operation Yes Yes Yes 
Within Arithmetic Unit 
Associative Yes Pseudo No 
Addressing 
High Order Language PFOR GLYPNIR FORTRAN 
Processing Speed 1 1 
Add 300 ns} 500 m8; » 160 
Multiply 1.9 us 700 ns"? 320 


1. Time for one PE; all PEs may operate in parallel 

2. Two operations may complete in this time 

3. May be computed as N2 times 0.85 s, where each operand is assumed to 
consist of N bits. 


ILLIAC IV 


The ILLIAC IV computer was a product of the 
mid-sixties, its original goals reflecting the prevail- 
ing optimism in the country and particularly in the 
young computer industry. It was the era of the 
“main frame houses” that continued to 


Illiac IV Installed at NASA Ames Research Center, 
Mountain View, California 


The seeds of the ILLIAC IV program evolved 
from a project called Solomon developed at the 
Westinghouse Corporation in Baltimore, 
Maryland. The circumstance that marked the of- 
ficial beginning of the ILLIAC IV program was the 
move by Dr. Daniel Slotnick, a Solomon principal, 
from Westinghouse to the University of Illinois 
and the subsequent designation of that institution 
as the prime contractor by the Advanced Research 
Projects Agency of the Department of Defense. 

The program plan was to have the University 
develop the system software and subcontract the 
hardware development on the basis of a com- 
petitive proposal. Study definition contracts 
awarded to Burroughs, Control Data Corporation 
and RCA resulted in three proposals in which 
Burroughs was awarded the hardware develop- 
ment contract in 1967. 

The central objective of the system was 10 
operations per second. This, of course, placed con- 
siderable emphasis on hardware component 
speeds and parallel architectural design [1]. The 
proposed system contained 4 independent 
quadrants of 64 Processing Elements (PE) each, for 
a total of 256 PE’s. Each PE contained an 
arithmetic element and a data memory and was in- 
terconnected to other PE’s which were a distance 
of +8 and +1 in designated value. Thus ina 8 x 8 
array, a nearest neighbor connection pattern was 
realized. 

Each quadrant was driven by a Control Unit 
decoding a single instruction stream and _ broad- 


casting the microstep for array instruction execu- 


tion. The Control Unit has a program memory and 
a separate station for executing CV instructions 
concurrently with array instruction. ILLIAC IV 
was a classical SIMD design. 


The Hardware 


The key components of the system design 
were: plainer thin film memories and multichip 
ECL logic circuit packages. Later events were to 
show that both choices were not realizable in the 
final system. | 

Thin film memories had been in development 
in Burroughs and elsewhere for several years prior 
to the start of ILLIAC IV. Thin film was con- 
sidered the performance successor technology to 
magnetic cores and Burroughs was actively en- 
gaged in the process of moving this technology 
from the laboratory into production. Two factors 
conspired to preclude this expectation before pro- 
duction was realized: the tenacity of magnetic 
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ILLIAC IV Backplane 


cores and the pace of semiconductor memories. 
When this situation became apparent, thin films 
were discontinued, as a product and, in turn, for 
ILLIAC IV. 

Upon the demise of thin film memory at 
Burroughs, a contract was awarded to Fairchild 
Semiconductor for the PE memory system using a 
64-bit bipolar component. This contract was one of 
the more successful projects of ILLIAC IV, calling 
for the design and production of 70 memory units, 
each with a capacity of 4K words. Considering the 
tight schedule and the new technology, many 
things that might have gone wrong did not: the 
memories were delivered on schedule and to 
specification. 

The total capacity of 250K words, limited by 
cabinet volume, was a performance disadvantage 
for the growing application programs that were 
run on the system. 

As part of the Burroughs proposal, Texas 
Instrument Corporation, acting as a subcontractor 
to Burroughs, agreed to provide the Processing 
Elements (PE) of the system, fully assembled and 
tested. A PE was a 64-bit floating point arithmetic 
[2]. The design was based upon a multichip package 
in which four (up chips) were mounted on a common 
substrate and interconnected by wire bonding. The 
circuit packages, 24-pin ceramic, were to be con- 
nected on a multilayer printed circuit board, one 
per PE. 

The published reason for the termination of 
the multichip development by the contractor was 


low production yield. The design process contained 


the fundamental weakness of the multichip ap- 
proach by postponing testing to a complexity level 
not justified by the value added and not repairable. 

The fall-back position was the use of the more 


conventional 14-pin DIP packaged ECL on smaller, 
2-signal-layer, printed circuit boards, connected by 
a wired backplane. The logic circuits used were the 
TI2500 circuit family, implying that the fault of the 
initial design was the package scheme. 

The foregoing component problems were the 
major ones and contributed to schedule delays and 
cost increases for redesign. In time, the program 
scope had to be reduced from four to one quadrant 
(256 PE’s to 64 PE’s) where the 10° operations per 
second would not be possible. 


The Software 


The system software development was the 
responsibility of the University of Illinois, which 
undertook the development of a new Algol-like 
compiler called TRANQUIL [3]. In addition, an 
assembly language development called GLYPNIR 
[4] commenced at about the same time. 

TRANQUIL was, of course, a major undertak- 
ing dealing with a parallel structure unlike any 
previous experience in compiler design. It con- 
tained language extensions to allow the users to 
identify parallel (vector) constructs and to manage 
the conditional states of the PE array. A 
preliminary version of TRANQUIL was completed 
and compared against the available GLYPNIR for 
object code performance. 

The results were disappointing but not 
necessarily unreasonable for the early stage of the 
compiler. TRANQUIL, however, was discontinued 
and GLYPNIR became the principal language for 
programming ILLIAC IV. Later, after the system 
was installed at NASA Ames, another language 
emerged called CFDL (Computational Fluid 


Dynamic Language). CFDL was based on Fortran 


and supported the principal applications for that 
agency. 


The Completion 


The ILLIAC IV system was shipped to NASA 
Ames in April 1972 and was accepted by the 
customer that December. The selection of the 
NASA site in lieu of the original one at the Univer- 
sity of Illinois was due in part to the campus unrest 
of that era and the possible target the system 
presented. The system has been operational now 
for almost a decade and is considered an effective 
and productive resource in the mission of that 
agency. 

To the people who designed and built the 
ILLIAC IV, it was certainly a triumph of skill and 
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ILLIAC IV Control Unit 


determination. The size and complexity of the 
system (250 thousand, dual, in-line components) is a 
challenge by today’s standard. ILLIAC IV also 
made its contribution to the science: 
a) It demonstrated that a SIMD architec- 
ture could be used effectively on some 
important applications. 
It showed that a system of that size and 
complexity could be used productively 
and reliably. 
It made the user community “vector con- 
scious” and motivated the work toward 
vectorizing compilers and the inclusion 
of vector operations in later product 
designs. 
A major drawback to a wider use of ILLIAC 
IV was the evolution in user environment. Modern 
compilers and operating systems removed the user 


from the hardware details of programming. The 
programming pioneering days were coming to a 
close. 


b) 


PEPE (Parallel Element Processing Ensemble) 


The history of PEPE development discloses a 
number of different corporations that contributed 
in varying measure to the final delivered product. 
PEPE as an architectural concept began in the 
mid-sixties at Bell Laboratories, New Jersey, 
under the auspices of the Army Ballistic Missile 
Defense Agency (ABMDA). An early prototype 
was assembled there at the time AT&T decided to 
divest itself of military development contracts. As 
a result, the System Development Corporation 


took charge of PEPE and, in turn, engaged 
Honeywell in support of the hardware design. 

In March, 1973 Burroughs was awarded a con- 
tract by SDC to build a revised and enhanced ver- 
sion of PEPE for ABMDA, Huntsville, Alabama. 
The system Burroughs was contracted to build was 
specified in detail, focusing primarily on the prob- 
lem of radar data processing for missile defense 
systems. 

The execution of the contract by Burroughs is 
considered an industry paragon and Burroughs 
was singled out for an outstanding performance 
award by the U.S. Army for this achievement. The 
completed PEPE system was shipped from Bur- 
roughs Great Valley Laboratories, Paoli, Pa. to 
Huntsville in May 1970 and accepted by the 
customer by November of that year. The only 
significant change from the original contract was 
the reduction of the number of processing 
elements from 36 to 11 due to a program funding 
reduction. 


The Design 


The PEPE design is considered special pur- 
pose because it is driven by the single application 
of radar target correlation and tracking. This ap- 
plication naturally lends itself to parallel process- 
ing since the processing functions are identical for 
multiple target returns and predictions. The 
PEPE is really three distinct linear arrays, each of 
which performs the parallel functions of correla- 
tion, tracking, and radar control, respectively. A 
Processing Element is a single orthogonal slice of 
these hardware elements, including a common 
memory and incorporating each of the three 
functions. 

Another important aspect of the PEPE ap- 
plication is that there is no requirement for inter- 
PE communication. This permits the PE’s to 
associate in a loosely coupled “ensemble,” with a 
significant reliability advantage as a result. Multi- 
ple failures in PE would degrade but not fail the 
system. The system was packaged with 36 PEs in a 
cabinet and a maximum of 288 PEs was permitted. 

The logic component family used in PEPE was 
the Motorola 10K ECL Family. MECL 10K was a 
mix of MSI and SSI completely packaged in 
ceramic DIPs. The memory was a 1K bipolar RAM 
produced by Fairchild Inc. The novel design of the 
printed circuit boards featured a combination of 
printed wiring and wrapped post wiring that 
avoided the problems of multilayer boards. This 
design, called the composite board, was used suc- 
cessfully on the BSP. 
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The Epilog 


The PEPE system was interfaced with a CDC 
7600 host system in the Huntsville complex and 
used to develop application programs. Later the 
system was shipped to McDonnell-Douglas, 
Huntington Beach, California for its intensive 


benchmark testing. These activities are classified 


and the results cannot be published here. It can be 
reported, however, that the hardware performed 
exceedingly well and the system was returned to 
Huntsville. 

The PEPE contribution might have been more 
formidable if the world political climate had war- 
ranted it so it may be assumed that it fulfilled a 
vital need. From an engineering viewpoint, it was 
simply a job well done. 


PEPE Cabinet, Front View 


BSP (Burroughs Scientific Processor) 


The Burroughs Scientific Processor (BSP) was 
the result of an effort to develop a standard prod- 
uct supercomputer that would serve the scientific 
user community with massive computational re- 
quirements. This application requires machines 
with special architectures that can perform at 
levels beyond those achievable by circuit speed 
alone. 

Fortunately, the programs often exhibit an in- 
ternal structure in which the same operator can be 
applied to arrays or vectors of data. This had led to 
the development of several SIMD supercomputers 
of either an arithmetic pipelined or parallel pro- 
cessor design (e.g. ASC, STAR, and ILLIAC IV [1)). 
Both techniques had resulted in vector computers 


PEPE Backplane 


whose effective computational rates on suitable ap- 
plications were one to two orders of magnitude 
greater than that of serial processors constructed 
of equivalent speed circuitry. 

The generality of these machines was limited 
by restraints on the application programs. Due to 
pipeline start-up time, very long vectors of data 
_ were often required. A small scalar content could 
seriously degrade performance levels. Finally, 
they were difficult to program, often requiring 
assembly language coding and memory residency 
analysis in order that the speed of the machine be 
fully realized. | 

For .these and other reasons, the only 
machines that had achieved commercial success by 
the early 1970’s were the CDC 6600 and 7600 
series which achieved their performance levels 
primarily by the use of very high speed circuitry 
and multiple function arithmetic processors. 

Given the recently completed ILLIAC IV pro- 
gram and ongoing PEPE program, Burroughs had 
developed expertise in parallel processing which 
could be applied to developing a commercial super- 
computer. This, coupled with the Corporation’s 
desire to field a FORTRAN processor to comple- 
ment the product line and provide a test bed for a 
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new generation of high speed current-mode logic 
(BCML), provided the impetus for the 
development. 

Although the BSP was not commercially suc- 
cessful, prototype and production models of the 
BSP were built, made operational, and in fact, met 
most of their design goals. The state of the com- 
puting art was advanced in several areas. 


Design Goals 


The beginnings of the program can be traced 
to a feasibility study on repackaging ILLIAC IV 
which was conducted in 1972. A survey of the user 
community clearly showed that a more refined, 
easier to use machine was required. This led to the 
development of the set of design goals listed 
below. 

Standard Product. The BSP was to be a stan- 
dard product. This implied that it was to conform 
to the corporate standards for manufacturability, 
testibility, reliability, maintainability, high level 
language programmability, ease of use and cost. It 
would be developed and manufactured by a stan- 
dard M&E (Manufacturing and Engineering) plant. 
Corporate standard hardware technology was to 
be employed, providing a volume basis for material 
costs and manufacturing tooling. 

Attached Processor. The BSP was to be at- 
tached to a large scale commercial computer 
system such as the B7700. This provided the 
capability to extend the FORTRAN performance 
of these machines and provided the user with ac- 
cess to the sophisticated system software 
developed for commercial large systems. 

Technology Driver. The Corporation was cur- 
rently engaged in the development of a high speed 
current mode logic family and its associated liquid 
cooled packaging technology, intended for use in 
Burroughs commercial plants. The BSP was to be a 
driver for this program. Thus it would provide 
schedule pressure on the components plants in ad- 
vance of commercial requirements and be a test 
bed to shake down the technology. 


Programmability. The BSP was to be effi- 
ciently programmable exclusively in a high order 
language. In practice, this meant that FORTRAN 
was the obvious choice. Any extensions were to be 
application oriented and machine independent. A 
vectorizer was to be provided as a means of effi- 
ciently executing existing codes. 


Ease of Use. The machine was to be easy to 
use. This was motivated by users’ desire to 
minimize the cost of developing and maintaining 


application codes. 

Performance. The BSP was to be capable of 
sustaining 20 to 40 MOPS on typical application 
codes in weather forecasting, nuclear reactor 
design, structural analysis, and other similar 
fields. This was to be measured on such standard 
benchmarks as the Livermore Loops. 

In order to achieve these goals, several key 
technical problems had to be solved. 

Scalar Problem. Some means had to be found 
to minimize the impact of scalar processing. This 
had been a bottleneck in then-current designs. 

Pipeline Start-up and Short Vector Perform- 
ance. A method had to be found for ameliorating 
the effect of pipe-start-up time so that high 
performance could be achieved on relatively short 
vectors. 

Memory Conflicts and Residency. A memory 
structure had to be devised that would minimize 
the effect of memory conflicts which occurred 
when elements of operand vectors resided in the 
same memory bank. This structure could not re- 
quire the user programmer to exhaustively study 
the application and specify special residency 
requirements. 

Automatic Bit Vector Control. Bit vector con- 
trol for data dependent branching and sparse vec- 
tor operations had to be built into the machine and 
made easy to use. 

Generalized Parallel Processing. The parallel 
processor had to be generalized so that it could be 
effectively employed in more applications. 
Research in parallel processing had resulted in 
many parallel algorithms for speeding up opera- 
tions previously thought to be serial (e.g. linear 
recurrences [8]). | 

Balanced I/O Structure. High performance 
secondary store was required and had to be ac- 
cessible without excessive operating system 
overhead. 

Self-checking and Fault Tolerance. Extensive 
self-checking and fault tolerant mechanisms were 
to be built into the machine so that high reliability 
and trustworthiness could be achieved. This was to 
be done without seriously degrading the perform- 
ance of the system. 


Architectural Design 


The solution of these problems was _ under- 
taken during the preparation of the PDA (Product 
Development Authorization — an internal pro- 
posal). This effort was completed in June, 1974. 
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The first issue to be decided was whether a 
pipelined or parallel processing approach would be 
taken. The latter was chosen because of the ease of 
implementing many of the sophisticated 
algorithms which had been discovered and the ex- 


‘pertise which had developed during the ILLIAC 


IV program. Finally, the iterative nature of 
parallel processors made them more suitable for 


VLSI implementation in the future. 


Once this had been decided, the memory con- 
flict problem was then attacked. Although many 
skewing techniques were known for minimizing 
conflicts, none had the generality and uniformity 
that was desired. The result of this effort was a 
scheme [9] which offered conflict-free access to any 
linear vector whose skip distance was not a multi- 
ple of the prime number of memory banks. Even 
more importantly, the memory mapping was ap- 
plication independent. 

The use of microprogramming was explored 


as a method of simplifying the programming of the 


machine and as a means of directly executing many 
common FORTRAN constructs such as nested DO 
loops with embedded assignment statements. This 
resulted in the development of the template con- 
cept, which allowed the overlapping of vector 
operations within the temporal pipeline of the 
parallel processor and solved the pipeline start-up 
problem. (Parallel processors do exhibit another 
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start-up phenomenon in that full speed is not 
achieved until the vectors are at least as long as 
the width of the array.) | 

The scalar problem was attacked with an eye 
to minimizing the number of scalar operations and 
overlapping their execution with that of the 
parallel processor rather than relying solely upon 
raw circuit speed. Scalar operations were reduced 
by the application of parallel algorithms, 
automating memory indexing and parallel pro- 
cessor control operations in hardware, and off- 
loading I/O operations to a smart controller. 

The remaining problems were solved in an ex- 
hilerating rush of discovery that culminated in a 
design which is remarkably similar to the final 
design documented in C. Jensen’s paper [6]. The 
one major difference is that there were 67 slower 
dynamic memory banks which fetched vectors of 
length 64. The 16 arithmetic processors then ex- 
ecuted the operation in 4 steps. Thus, the machine 
reached full speed at vectors of length 64. This 
allowed the use of low cost main memory. 


BSP Demonstrating Class 6 Qualification 


Detailed Design 


In the detailed design phase of the program 
(June, 1974 to August, 1976) the implementation of 
the concepts developed during the proposal was 
pursued. It had not been clear that the alignment 
network and automatic indexing hardware could 
be built out of a reasonable number of IC’s or that 
there would not be a combinatorial explosion of 
microcode. These problems were overcome and the 
design had successfully incorporated the features 
of the architecture. 

The applications group had found that length 
of vectors in many codes were shorter than 64. It 
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would be desirable to improve the short vector 
performance of the machine. The advent of low 
cost high speed static NMOS memories such as the 
2147 made it possible to do this. The number of 
memory modules was reduced to 17 and the 
memory cycle time speeded up by a factor of 4. 
This allowed the parallel processor to come up to 
speed at vector lengths of 16 while providing the 
additional benefit of simplifying the design. 

This had the result of throwing the design into 
imbalance. The scalar processor had to prepare 
descriptors four times as fast as before. The scalar 
unit had to be speeded up in order to fully take ad- 
vantage of the faster parallel processor. 


The Turning: Point 
& 

A related sequence of events occurring in 
1977 had a large effect on the program. It had been 
observed that the scalar unit was, itself, functional- 
ly complete and could be offered as a lower cost At- 
tached FORTRAN Processor (AFP). This product 
appeared to be relatively free and was adopted. 
However, it resulted in two releases, two sets of 
software, the development of a DISK version of 
the I/O system, and an interface to the B 6800. This 
represented a significant additional workload on 
the project. | 

The BCML development was very late and did 
not meet the original performance goals. A pro- 
posal to implement the first machine in the proven 
hardware of the PEPE system was _ rejected 
because the objective of driving the technology 
was deemed essential. 

It was becoming clear that the performance of 
the scalar unit would not support application pro- 
srams that did not contain a sufficiently high con- 
tent of vector operations. The design of the scalar 
unit was straightforward, to minimize the overall 
development risks to the program. The perfor- 
mance on the Livermore Logics benchmarks (a 
scalar-vector mix) reinforced our strategy, but a- 
broader product approach would require a 
performance enhancement of the unit. At this 
point, with limited time and resources, it was felt 
the problem could be addressed in a subsequent 
product upgrade after the production start of the 
present design. 


Making It Work 


The machine was debugged during 1977 to 
1980. There were many problems to overcome. In- 
itially, late deliveries of circuits delayed the pro- 


gram. When sufficient quantities were available, 
the hardware was built and put into system test. 

The hardware technology was completely 
new, from the circuits to all three levels of packag- 
ing. In addition, the emerging CCD technology was 
to be employed for a second level store. Given the 
number of new items, it perhaps is not surprising 
that some design problems surfaced. 

The first design of the sockets exhibited loose 
contacts, the proms speeds drifted, and there was 
a damaging latent fault in the zinc pillow blocks. 
These blocks were screwed in to hold the PWB 
assembly together and were under high pressure. 
They exhibited a cold flow phenomenon which 
caused the screws to slowly pull out. The 


assemblies were literally pulling themselves apart. 


A third of the machine had to be reworked in the 
midst of debugging. The CCD devices exhibited a 


high soft failure rate and were difficult to 


manufacture. 

These problems were overcome and the pro- 
duction hardware was fully qualified, very reliable, 
and exceptionally stable. There were practically 
no electrical intermittents reported. The CCD 
memory was replaced by a dynamic RAM system. 
While this process of shaking down the hardware 
technology fulfilled one of the main objectives of 
the program, it delayed getting the machine into 
the marketplace at a critical time when CRAY was 
making deliveries for almost 2 years. 

The software set was new and fully featured. 
The maturization of this amount of software took a 
long time and prevented us from routinely running 
customer benchmarks. This was aggravated by the 
temporary loss of all 7700’s for customer 
shipments, which resulted in no system manager 
to debug the deliverable software (the alternate, 
but different, 6800 software was used instead). 
Nonetheless, by 1979, limited benchmarks could be 
run to measure the performance characteristics of 
the system. 


Performance Measurement and Marketing. In 
the codes that were tested, the design lived up to 
its promise as an excellent vector processor. The 
livermore loops ran at over 20 MOPS. In general, 
most comparisons showed that the machine was 
equivalent in performance to the CRAY I for many 
vectorizable codes. This was true even though the 
short vector performance of the parallel processor 
was only being partially realized and the hardware 
components were considerably slower. 


32 


Although the large main memory and fast 
secondary store was an advantage in large prob- 
lems, users preferred the CRAY due to the 
guaranteed performance levels that could be 
achieved on existing non-vectorized and scalar 
codes. 7 
Conclusion. The cancellation of the BCML and 


CCD programs, the attendant cost increases, the 


loss of an appropriate marketing window, and the 
lack of a dominant scalar speed led to the cancella- 
tion of the product. The design proved that it was 
possible to configure a parallel processor which 
was competitive in vector applications and con- 
siderably more general than those that had been 
designed in the past. This drive for generality is 
expected to continue into the next generation of 
MIMD architectures. 
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CONTROL DATA 6600 AND STAR-100 


James E. Thornton 
Network Systems Corporation 


Brooklyn Park, 


Abstract -- This paper reviews 
some of the starting point assumptions 
and considerations for two design pro- 
jects at Control Data Corporation; namely 
the CDC 6600 and CDC STAR-100. Each of 
these has had follow-on computer families, 
CYBER 70/170/700 for the former and 
CYBER 203/205 for the latter. Both pro- 
jects were very ambitious and plowed new 
ground in the use of parallelism in large 
scale computer design. : 


The CDC 6600 


The design of the CDC 6600 began in 
1960 [1]. The first transistor computers 
had just been delivered to the field that 
year. Ideas having to do with parallel 
processing were presented at a short- 
course conference at UCLA in which STRETCH, 
LARC, ATLAS, ILLIAC-II and GAMMA 60 were 
examined. These machines were in develop- 
ment in the United States, England and 
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France. Each attempted to exploit ways 
to reduce the idle activity within parts 
of the computer waiting for other parts 
to complete a sequential action. Since 
access to memory was often the biggest 
delay of this kind, many of the ideas had 
to do with relieving this burden. An 
example was the "instruction lookahead" 
of STRETCH. 


In 1960 it had already become apparent 
(as it has grown more important over the 
years) that brute force circuit perfor- 
mance or parallel operation were the two 
main approaches to any advanced computer 
[2]. The 6600 project attempted a fast 
version of the building block circuit in 
use at the time only to fail early on in 
the project. This resulted in a restart 
with a more complex packaging and cooling 
scheme providing a Significantly higher 
density of parts. Discrete transistors 
were used since integrated circuit 
families had not become available. 
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The first area of parallel operation 
began as a separation of input/output op- 
erations from the CPU, (Figure 1). It 
was felt that independent small processors 
with direct access to central memory 
could be dynamically assigned to control 
peripheral devices and transfer data 
between each device and central memory. 
These peripheral processing units (PPUs) 
would contend with each other for the 
channels to the devices and for the access 
to central memory. In the latter, the 
PPUs would also contend with the CPU for 
access to central memory. With the excep- 
tion of this latter factor, the PPUs 
could be designed separately from the CPU 
and represented a very convenient separa- 
tion of design duties for the design team. 
The resulting design was an innovative 
"barrel" of registers together with a 
Single arithmetic unit implementing ten 
small processors each having its own 
memory. Later implementations in follow- 
on CYBER products included physically 
independent PPUs as integrated circuit 
versions were introduced. 


In the CPU additional areas of par- 
allel operation were applied to instruc~ 
tion lookahead, multiple working registers 
and functional units. See Figure 2. 
Taking the registers first, the idea of 
inserting registers between the execution 
logic and central memory provided a means 
to optimize and overlap vead and write 
references to central memory. At the 
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beginning of the execution of a CPU pro- 
gram (or at a restart following interrup- 
tion) a single simultaneous exchange is~ 
made of the contents of the 24 registers 
and a prepared location in central memory. 
This action provided very rapid exchange 
of jobs (or operating system routines) 
enhancing the ability to support multi- 
programming. 


During execution of the CPU program 
instructions are fetched from central 
memory into an instruction stack capable 


of containing 8 60-bit words, see Figure 3. 
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Each new instruction word is immediately 
fed to the functional unitsfrom execution 
but is also retained and "pushed up" for 
possible re-use in certain loop routines 
or the like without the further require- 
ment of fetching from central memory. In 
certain follow-on CYBER products of lower 
performance the instruction stack was not 
utilized nor were separate functional 
units. The instruction stack was a 
principal important ingredient in estab- 
lishing a high degree of concurrent 
operation in the functional units. 


A further important ingredient in 
supporting concurrent operation in the 
functional units was a control unit called 
the SCOREBOARD [3]. In essence, this unit 
kept track of reservations of the working 
registers allowing functional units to 
reserve each register it needed for either 
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reading out or writing into the register. 
Instructions could be "issued" to a func- 
tional unit in order, could be executed 
out of order, but would return results to 
registers in order. As a result, no 
functional unit would block the issuance 
of instructions unless a unit was busy 

or a register reservation could not be 
made. 


Optimization of short program loops 
in the instruction stack could produce 
dramatic overlap of functions. Alsosimply 
the implicit use of this control technique 
and the existence of ten functional units 
provided a degree of natural concurrency. 


From a design point of view each func- 
tional unit was specially designed to 
execute its narrow group of instructions 
interfacing only to the registers and 
minimal control signals. As a result 
several of the units achieved very high 
performance avoiding "impediments" of 
sharing logic with other functions. 
CYBER products incorporated further 
"pipeline" arithmetic design to such 
functional units. 


Later 


Two principal criticisms were leveled 
at the CDC 6600. The first was that it 
was not a time sharing machine. This 
criticism has been hotly contested and 
arose from the lack of interrupt on the 
PPUs and the lack of virtual memory for 
memory management. A second criticism was 
the lack of variable length string arith- 
metic and instructions for character 
handling anddecimal numbers. Software 
routines to accomplish these requirements 
proved slow in relation to the IBM 360 for 
example. For scientific and binary 
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machine. 
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oriented computing though, this machine 
was superior for its day and for many 
years even to today. 


The CDC STAR-100 


Control Data began this project in 
response to a request for proposal (RFP) 
from Lawrence Livermore Laboratory (LLL). 


Preceding this were requests for informa- 


tion (RFI) and other interaction with the 
LLL people as to the possibility of CDC 
being the manufacturer of an ILLIAC IV 
type of machine. Management response to 
that suggestion was negative. Also our 
technical response was that we had a 
different idéa. The essence of this 
different idea was to build on our growing 
knowledge of "pipeline" architecture jin 
response to the requirement. 


The CDC 7600 (follow-on to the CDC 


6600) utilized an improved functional 


arithmetic unit design which allowed each 
unit to be entered with new input operands 
well before previous operands were pro- 
cessed and results obtained. This added 
pipelining of the execution units along 
with the instruction stack and control 
brought additional concurrency to the 

It was felt that this could be 
enhanced further by explicit pipeline 
instructions for the CDC STAR-100. 


During 1965 and 1966 Control Data 
faced significant competitive pressure 
from IBM in particular and was attempting 
to expand the role of the CDC 6600 into 
commercial (non-scientific) markets. 


Principal competitive factors were the 


lack of variable length byte oriented 


instructions, decimal arithmetic and 
virtual memory. Interal strategies in CDC 
were pressing for new machines stressing 
these properties. Thus the STAR-100 pro- 
ject moved to respond. The variable 
length and 8-bit byte STRING oriented 
instructions were a somewhat natural fit 
with the idea of ARRAY instructions 
utilizing highly parallel pipeline execu- 
tion. The name STAR-100 was formed from 
STRING and ARRAY with the objective of 

100 million operations per second. STRING 
operations were assumed to be executed in 
a separate functional unit and thus were 
not considered an impediment to the high 
performance end of the machine. 


Moving to the eight-bit world from 
the octal and six-bit environment was a 
major learning experience and further 
complicated by a shift from one's comple- 
ment to two's complement representation 
of binary numbers. But the fundamental 
new area of design was the processing of 
vectors and arrays through pipelin 
arithmetic. | 


The CDC STAR-100 computer was 
structured around a 4-million to 8-million 
byte high-bandwidth magnetic core memory. 
Instructions specify operations’ on vari- 
able length streams of data allowing full 
use of the memory bandwidth and the 
arithmetic pipelines [4]. In streaming 
mode the system has the capability of 
producing 100 million 32-bit floating 
point results per second. See Figure 4. 
Memory has 32 interleaved banks, each 
bank containing 2048 512-bit words (for 
the 4-million byte capacity). The memory 
was an outgrowth of previous extended core 
storage (ECS) systems built for the CDC 
6600. The long word length and relatively 
slow cycle of 1.28 microseconds with 
interleaved banks was suitable for 
streaming use. Working registers in 
local storage include 256 sixty-four bit 
general registers. Operands could enter 
the multiple pipeline arithmetic either 
from the general registers or from | 
central memory; similarly, results could 
return to the general registers or 
central memory. 
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Vector instructions perform operations 
on ordered scalars. Such instructions are 
64 bits in length and contain the instruc- 
tion code and three pairs of eight-bit 
designators. 
means to support a three address environ- 
ment. In general two input streams and 
one result stream are defined. 
input streams each pair of designators 
defines working registers (of the 256 
general registers) which contain the base 
address, vector length and an offset to 
the base address, Length and offset are 
also held in defined working registers 
together with the base address of a 
control vector. The control vector is a 
bit string in which each unique bit is 
associated with the storing of each result 
element in the result stream. A bit in 


These designators provide the 


for the two 


the control vector can prohibit the storing 


of a result element, thus providing for 
certain masking and boundary controls. 


Thus a single explicit instruction can 
direct the execution of many floating 
point operations in a highly organized 
fashion. In practice, the preparation and 
synchronization of the three streams was 
very complicated and required a longer 
"start up" period than had been expected. 
As a result, the CDC STAR-100 was more 
efficient the longer the vectors were. 
Also with a slow central memory, scalar 
operations were not competitive although 
offset somewhat by the high speed general 
registers. 


Both projects suffered delays with the 
CDC STAR-100 being the longer and 
several years in duration. Problems with 


37 


6600 occurred very early and cost only 
about a year. Problems with the STAR-100 
occurred very late in the planned schedule 
and resulted in an extended delay. Both 
projects were exceedingly aggressive 

and far reaching. The properties of the 
CDC 6600 have enabled a long lasting 
product line. The properties of the 

CDC STAR-100 are only now being exploited 
in the CYBER-200 machines. 
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Abstract -- This paper addresses the problem 
of programming distributed systems within the 
framework of the Ada language, which provides 
primitives for interprocess communication based 
upon the model of Communicating Sequential Proc- 
esses. We first discuss our basic assumptions 
concerning the underlying target configuration, 
the physical communication medium which is to sup- 
port that application and pattern of the logical 
communication within the application proper. We 
then develop a first approach for constructing 
such applications using the separate compilation 
facilities of Ada. Finally, we consider two pos- 
sible protocols for implementing the requisite 
distributed interprocess communication, referred 
to as the Remote Entry Call and the Remote Proce- 
dure Call, respectively. 

1. Introduction 

This paper addresses the problem of program- 
ming distributed applications within the framework 
of the Ada language [3,2,5]. Our ambitions here 
are confined to outlining a first approach in this 
area, whence a number of significant issues asso- 
ciated with the construction of such software are, 
of necessity, deferred. We begin in Section 2 by 
setting forth the basic assumptions which underly 
the overall approach described herein. Section 3 
is concerned with establishing an appropriate com- 
pile-time framework, within which the programming 
of an application destined for a multi-processor 
target configuration can be carried out in much 
the same way as one intended for a uni~processor 
target. In the final section, we turn to the 
development of protocols to support the requisite 
"interprocessor procedure call" capability, so 


that the applications of interest can then be 
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programmed without further regard to the distribu-~- 
ted nature of the underlying target configuration. 
Two successive versions of such a protocol are 
defined. These are referred to as the Remote Entry 
Call and Remote Procedure Call, respectively. 
2. Basic Assumptions 

This section outlines our basic assumptions 
concerning the nature of the distributed applica- 
tion systems to be programmed in Ada. Abstractly, 
we wish to conceive of some given target configur- 
ation, onto which a certain application is ulti- 
mately to be mapped, as a network of communicating 
"Ada Virtual Machines" (AVMs). Every such config- 
uration may therefore be characterized in first 


instance by an undirected graph, as depicted for 
example in Fig. 2-1: 


FIGURE 2-1: 


A network of communicating Ada Virtual Machines. 


The individual nodes of a particular network. 
correspond to fully independent (autonomous) proc- 
essors, each of which is capable of executing a 
complete Ada program. Accordingly, ag Ada Virtual 
Machine is to be viewed as an idealized single- 
processor environment that directly implements the 
run-time facilities required to support the seman- 
tics of the full Ada language. Thus the concept 
of an AVM embodies an abstract object machine for 
which Ada source programs might conventionally be 
compiled (but disregarding all dependencies upon 
a specific hardware architecture and/or host oper- 
ating system); concretely, it may be thought of as 
providing its own address space, scheduler and , 
real-time clock, together with a certain set of 


external interrupts, low-level device interfaces, 
etc. We refer to this environment as a "virtual" 
(rather than "actual") machine so as to also eli- 
minate considerations arising from the fact that 
several such machines might sometimes be multipro- 
grammed on the same physical processor (e.g., in 
the context of an underlying time-sharing system). 


The connecting edges appearing in a given 
network represent possible paths of bidtrecttonal 
communication between distinct processor nodes. 
(Non~connecting edges, like those shown in Fig. 
2-1, are meant to suggest additional paths of com- 
munication, for instance with various devices 
attached to the individual virtual machines; how- 
ever, interactions with purely local resources of 
this sort are of no direct interest here, and so 
will not be further discussed.) The connectivity 
of such a network is assumed to be sufficient for 
supporting the intended pattern of interprocessor 
communication, meaning that each edge corresponds 
to a path whereby both the requisite data and any 
appropriate control signals can be physically 
transmitted between the two connected nodes; more- 
over, the bandwidth of these connections is pre- 
sumed to be adequate for the application at hand. 


We shall assume that the target configuration 
for any specific application is always stattcally 
defined--i.e., that the number of virtual (and 
even actual) processors is established once and 
for all, and that the necessary paths of communi- 
cation exist from the outset. The primary stipu- 
lation which we impose is that all interactions 
between separate nodes of the network thereby 
defined must be achieved by explicit communication 
across these more or less "thin wire" connections. 
In other words, we preclude interactions based 
upon the existence of shared memory or any form 
of centralized control. This implies that the 
application in question must be formulated from 
the beginning as a distributed system. The issue 
we wish to address is how one might go about pro- 
gramming such applications in Ada, so as to be 
able to effectively map those programs onto the 
given multiprocessor configuration. 


Ada provides an adequate basis for program- 
ming systems of communicating sequential processes 
[1], and for supporting synchronous communication 
between these processes. Once some desired pat- 
tern of logical communication has been established 
(for example, that depicted in Fig. 2-2), there 
is no particular difficulty involved in formula- 
ting the specifications and subsequent definitions 
for the corresponding caller and server processes 
(or subsystems). Insofar as the resultant pro- 
gram is destined to be executed ona single proc- 
essor configuration (as represented by the Ada 
Virtual Machine considered here), the job is 
effectively done once all of the separate compi- 
lation units comprised by that program have been 
successfully compiled (since an AVM is assumed to 
be capable of directly executing any complete Ada 
program, regardless of its textual decomposition). 


However, when the target configuration is a 
network of interconnected AVMs (e.g., Fig. 2-3), 
then it is far less obvious how to proceed. The 


ag 


FIGURE 2-2: 


Example Application, in terms of Communicating Sequential Processes 


AVM 2 AVM. 3 


FIGURE 2-3: Example Target Configuration, in terms of interconnected Ada Virtual Machines 


effect that we should like to achieve is to be able 
to essentially "superimpose" the intended pattern 
of communication upon the underlying network (as 
suggested by Fig. 2-4), thereby preserving the 
overall logical structure of the application. 

While the ability to do so presupposes. that the 
application in question was formulated as a distri- 
buted system in the first place (i.e., based solely 
upon communicating sequential processes), it 
should then be possible to map that structure onto 
any appropriate target configuration, whether cen- 
tralized or distributed. This is the premise of 
the approach outlined in the present paper. 


FIGURE 2-4: 


Superposition of Example Application upon the given Target Configuration 


3. Overall Framework 


In this section, we shall outline a basic ap- 
proach to constructing a distributed application, 


such as that depicted in Fig. 2-4,. by making exten- 


sive use of the separate compilation facilities in 
Ada (and also of the related capabilities for 


generic program units). The framework to be devel- 


oped here must be regarded as simply a first 
approach to the problem whence many practical 
aspects associated with building distributed soft- 
ware will have to be glossed over (or neglected 
entirely) in the present context. (In particular, 
we shall be concerned solely with constructing a 
definition for the steady-state operation of a 
given application, even though it is well known 
that the issues involved in startup and shutdown 
of a distributed system are far more difficult to 
address.) This approach nonetheless provides a 
number of important insights into the nature of 
the problem itself. 


The package declaration that follows shows, 
in skeleton form, an initial specification for the 
application as a whole: 


package Config is 


type NODE is (NNS$1, NNS$2, ..., NNS$n); 
~- Node Names 
type NSET is array (NODE) of BOOLEAN; 
| -~- Set of Nodes 


package Node$l1 is ... end; 


package NodeSh is 


‘type OPER is (OP$1, OPS$2, ..., OPSk); 
~- Op Codes for Remote Services 
-- other type definitions ... 


Host: constant NODE 
Conn: constant NSET 


NN$h; 

(...=> True, 
others => False); 
-- other constant declarations ... 


generic 
Site: in NODE; 
package Service is 
procedure PS1 (...); 
procedure PSk (...); 
end Service; 


end NodeSh; 


package Node$n is ... end; 
end Config; 
In order to formulate such definitions, we have 


adopted the (purely lexical) convention of writing 
names with an embedded dollar sign, so as to be 


able to refer to unique identifiers as if they were 


elements of a set distinguished by means of sub- 


scripts. For instance, the declaration of the enu- 


meration type NODE is meant to suggest a range of 


values NN], NNo, -.-.-, NN,, whereas in practice the 
individual values would correspond to application- 
specific mnemonic names (e.g., NN, might be writ- 
ten as the Ada identifier "FileServer"). Also, 
P$1, ..., P$k denote the particular procedural 
services which that individual node provides. 


This first specification consists primarily of 
package specifications for the constituent nodes 
of the overall configuration. The logical inter- 
face of each separate node comprises, in addition 
to various type and constant declarations, the 
declaration for a generic package Service, which 
will ultimately be instantiated within the defi- 
nition of other (caller) nodes. 


The associated body for the package Config, 
shown below, serves to establish the overall con- 
ventions which are common to all nodes. As such, 
it is primarily concerned with defining the under- 
lying communications interface, by which informa- 
tion will be physically interchanged between dis- 
tinct (virtual) machines within the configuration. 
These conventions are embodied firstly in a series 
of data type definitions, including: 


- XREC, corresponding to a "transaction record" 
that contains at least an indication of the 
respective source and destination nodes for 
each transmission, as well as an encodement 
of the particular "operation code" for that 
particular transmission; 


- XMIT, corresponding to a complete transmis- 
Sion, as delivered to or received from a 
local communications interface, which 
includes both an XREC component and an asso- 
ciated buffer (whereby argument or result 
data may be forwarded). 


Two different types of transmission are dis- 
tinguished at the communications level, namely 
Transmit Call (XC) and Transmit Response (XR), 
and the corresponding subtypes of XMIT are also 
defined (CALL and RESP, respectively). 


Finally, the actual communications interface 
is specified in the form of two distinct generic 
packages, ChnDriver and ChnServer. Each of these 
have a number of generic parameters, in particular, 
an operation Request and an operation Deliver 
which will be bound in the context of their sub- 
sequent instantiations in order to carry out the 
necessary acquisition and disposition of trans- 
missions over the underlying medium. This inter- 
face is assumed to take full responsibility for 
setting and using the Orig and Dest Fields of the 
transaction record part of such transmissions. 
The details of these interfaces will not be fur- 
ther specified here. 


with Medium; 
package body Config is 


function Card(N:in NSET) return INTEGER range 
O..NODE'POS (NODE'LAST)+1...; 


subtype OPID is INTEGER range 0.....; 
~~ Max Op Code © 


type XREC is record 
Orig, Dest: NODE; 
Code: OPID; 


end record; 


‘type BUFF is ... ; 
type XTYP is (XC, XR); 


type XMIT(T: XTYP) is record 
X: XREC; 
B: BUFF; 

end record; 


subtype CALL is XMIT(XC) ; 
subtype RESP is XMIT(XR); 


generic 
From, To: in NODE; 
with procedure Request(C: in out CALL) ; 
with procedure Deliver(R: in RESP); 
package ChnDriver; 


generic 
From: in NSET; 
To : in NODE; 


with procedure Request(R: in out RESP); 
with procedure Deliver(C: in CALL); 
package ChnServer; 


package body ChnDriver is ... use Medium; 


ee. end; 


package body ChnServer is ... use Medium; 


eee End; 


package body Node$1 is separate; 


package body NodeSn is separate; 
end Config; 


We now introduce analogous definitions for 
each separate node of our distributed configura- 


tion (the outline for that representing the NodeSh 


is shown below). In this instance, however, such 
a step no longer constitutes an "extra" level of 
abstraction; rather, it is essential -- for this 
is the first place in which we permit actual in- 


stantiations (of code or data), since we have only 


now reached a level that corresponds to some phy- 
Sical machine environment. 


The definition of such a shell serves to 
establish what might be construed as an "Applica- 
tion Virtual Machine," in terms of which the con- 
stituent subsystems of the actual application 
(e.g., the modules AS$1...ASm) may then be pro- 
grammed without further regard to the distributed 
nature of the underlying target configuration. 
This definition serves to provide: 
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- An indication of the target environment for 
this particular node (pragma SYSTEM) ; 


- The specification of the application modules 
to be hosted within this node (the package 
declarations for AS1...AS$m); 


- A mapping of the remotely callable services 
provided by this node onto the operations 
defined by those modules (e.g., renaming of 
PS$i); 


- Definition of both sides of the higher-level 
protocol required to support such remote 
calls, namely the driver side (the body of 
the generic package Service) and the server 
side (the body of the non-generic package 
Support) ; 


- Finally, instantiations of the remote serv- 
ices needed to tmplement the application 
modules of this node (package NodeS$u, 
NodeSv, etc.). 


separate (Config) 
package body Node$h is 


pragma SYSTEM(...); 
~- Specify local application modules: 


package AS$1 is 
procedure QS$1(...); 
procedure QSf(...); 

end AS1 

package A$m is 
procedure QS1(...); 
procedure Q$g(...); 

end AS$m; 


-~- Local (re)definition of services: 


procedure PSi(...) renames AS$a.QSb; 


-- Support services called remotely: 


package Support; 
package body Support is -- Server side of Protocol 


end Support; 


package body Service is -- Driver side of Protocol 


end Service; 
-- Provide services needed locally: 


package Node$u is new Config.Node$u.Service 
(Site => Host); 

package Node$v is new Config.Node$v.Service 
| (Site => Host); 


package body AS1 is separate; 


package body ASm is separate; 
end NodeSsh; 


Within the framework of this shell, the appli- 
cation modules would again be defined as separately 
compiled. subunits: 


separate (Config.Node$h) 
package body AS1 is 


~-- NodeSu.PSi(...) ... 


end ASI; 


separate (Config.NodeSsh) 
package body A$m is 


vee: NOGESVsPS7] (ese) See 
end ASm; 


The approach outlined above effectively makes 
use of the Ada "Program Library" to establish the 
context in which individual components of a distri- 
buted application may be defined in terms of a 
purely procedural interface to services which are 
nonetheless hosted on different nodes of a distri- 
buted target configuration. The possible proto- 
cols by which such an “interprocessor procedure 
call" capability might be realized are the subject 
of Section 4 of this paper. 


It must be pointed out, however, that the 
usage of the Ada separate compilation facilities 
described above, while legitimate in every respect, 
May nonetheless cause a potential problem in the 
context of overly "naive" implementations of those 
facilities. Specifically, the issue arises in 
conjunction with circular dependencies (wherein 
Node; calls Nodes, and so must instantiate its 
Service package which is defined in the body of 
Node», and vice versa). Whereas this,too, could 
be "programmed around" (at the cost of considerable 
effort and obscurity), in this instance it would 
seem preferable to wait for more mature implemen- 
tations. 


4. Possible Protocols 


In this section, we shall be concerned with 
possible protocols by which the desired interproc- 
essor procedure call capability might be implemen- 
ted for a particular distributed application. 

Thus, at this point, we shall elaborate upon actual 
definitions for the driver side (which serves to 
map such calls onto the communications interface) 
and the server side (which acts to carry out such 
calls on behalf of any remote caller); these imple- 
mentations correspond to the bodies of the packages 
Service and Support, respectively, which are 
defined within the body for the node wherein those 
remotely callable services are to be hosted. 
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For purposes of exposition, we shall consider 
only one instance of such a definition, that asso- 
ciated with the virtual machine NodeSh (which | 
makes available the operations PS$1...PS$k) and, 
moreover, we shall sketch out the detailed imple- 
mentation for only one of the operations in ques- 
tion, identified throughout as P$i. This involves 
no loss of generality, since the structure for all 
other operations and nodes is essentially the 
same. Accordingly, the overall goal for the imple- 
mentations that will be described here is to pro- 
vide the capability suggested by Fig. 4-1, namely 
to permit application processes such as Aj, Agr 
B...C, residing on separate (virtual) machines, to 
invoke the operation P; hosted by Node; (corre- 
sponding to yet another such virtual machine) as 
though by a simple (local) procedure call. 


FIGURE 4-1: Overview of the Required Capability, to Support Remote Calls on the operation Py 


To simplify the presentation, we shall assume 
that the operation of interest has the following 
specification: 


procedure PSi (Al:in TAl1;...;Ax:in TAx; 
Rl:out TR1;...;Ry:out TRy); where Aj stands out 
for the jth input argument (of type TAj) and Rk 
stands for the kth output result (of type TRk); 
formal parameters of mode "in out" are thus pre- 
sumed to have been decomposed into separate input 
and output objects. We note that some restric- 
tions must be imposed upon the types of parameters 
in the present context. Specifically, it must be 
possible to copy the associated objects from one 
machine to another, which apparently precludes 
the passage of task or “limited private" types 
(for which assignment is. not defined). Similarly, 
it must be possible to meaningfully refer to such 
objects both locally and remotély, which precludes 
the. passage of access types (except when declared 
as “private"). . 


In the subsections which follow, we shall 
develop two alternative definitions for the 
desired protocol, referred toas the Remote Entry 
Call and the Remote Procedure Call, respectively. 


In the first (and simpler) version, we impose 
the property that, from each distinct caller node, 
there is at most one remote call to any given 
operation in progress at a time. Such an imple- 
mentation would be appropriate, for example, in 
cases where the operations to be invoked are known 
to be entries (i.e., serviced in a purely 


sequential fashion), whence there is no advantage 
to be gained by forwarding more than one poten- 
tially concurrent call from some particular node 
(since these would then have either to be buffered 
within the communications medium or enqueued by 
the corresponding server node). 


The second version relaxes this restriction, 
allowing a (bounded) number of calls on the same 
operation to proceed concurrently from within each 
separate caller node. This somewhat more compli- 
cated strategy might be adopted in situations 
where there is some optimization to be achieved 
(on the server side) by recognizing new calls 
before all previous ones have been completely serv- 
iced (as for instance in the context of a disk 
scheduler). 


It must be stressed that there is no semantic 
distinction between these alternative implementa- 
tion strategies. The choice affects only system 
throughput and thus the overall performance of the 
application in question; it should therefore be 
made on that basis alone. 


We shall now proceed to develop Ada defini- 
tions for these two alternative protocols, 
expressed primarily in terms of the synchronous 
communication primitives embodied in the tasking 
facilities of that language. Each of the imple- 
mentations to be described consists of the driver 
side (the body of the generic package Service, 
which is to be instantiated within one or more 
remote caller nodes), and the corresponding server 
side (the body of the package Support, which 
resides within the Ada Virtual Machine that hosts 
the operations in question). 


4.1 The Remote Entry Call 


As stated above, the first strategy is based 
on the property that no more than one remote call 
on each operation is in progress from the same node 
at any given time, so as to avoid saturation of 
the communications medium or overloading of the 
corresponding server node. As such, this property 
is necessarily established on the driver side of 
the protocol defined below. 


4.1.1. Driver Side. The overall structure 
and associated data-flow for the driver side are 
depicted in Fig. 4-2. Calls on the operation PSi, 
originating from application tasks Ta...Tz are 
fielded by an Agent which is specific to that oper- 
ation (AGTi); this latter acts to acquire the input 
arguments for each individual call (Al...Ax) and to 
subsequently deliver the corresponding output 
results (R1...Ry). These two separate transactions 
for every operation hosted by Node; (P$1...P$k) are 
dispatched via distinct processes, the Driver Call 
Handler (DCH) and the Driver Response Handler (DRH), 
which respectively act to forward calls and 
retrieve responses from the Local Channel Driver 
(LCD) for Node,. These handlers are formulated as 
independent (concurrent) processes so that the 
order in which LCD requests calls or delivers 
responses will not be unnecessarily constrained by 
this protocol. | 
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DCH 


(Call Handler) 


FIGURE 4-2: Overall Structure and Data-Flow on the Driver Side for the Remote Rntry Call Protocol. 


The outline of (generic) package body for the 
driver side is shown below: 


package body Service is 
-- Driver Side, defined in Config.NodeSh: 


task DCH is 
entry ReqCall(C: in out CALL); 
entry DCS$1(...); 
entry DCSi(Al: in TAl; ...; Ax: in TAx); 
entry DCSk(...); 
end; 


task DRH is 
entry DelResp(R: in RESP) ; 
entry RRS$1(...); 
entry RR$i($1l: out Trl; ...; Ry: out TRy); 
entry RRSk(...); 
end; 


package LCD is new ChnDriver ( 
From => Site, To => Host, 
Request => DCH.ReqCall, 
Deliver => DRH.DelResp) ; 


package DS1 is ... end; 
package DSi is 
procedure P(Al: in TA]1;...;Ax: in TAx; 
Rl: out TR1;...;Ry: out TRy); 
procedure PutArg(B: in out BUFF; 
Al: in TAl: ...; Ax: in TAx); 
procedure GetRes(B: in BUFF; Rl: out TRI; 
-.-; Ry: out TRy); 
end DS$i; 


package DSk is .. end; 
procedure PS1 (...) renames DS1.P; 
procaduee PSk (...) renames DSk.P; 
... + bodies of DCH, DRH, DS1l, ..., DSk 
end Service; 
The handler processes DCH and DRH are directly 


specified in terms of Ada tasks, with entries to 
be called by the channel driver and by the agents 


for the remote operations to be invoked. LCD is 
obtained by instantiation of the generic defini- 
tion associated with the overall configuration. 
For each operation, there is then a corresponding 
Driver package, D$1...D$k, which provides an oper- 
ation P to be called by an application process © 
(as P$i) along with operations for moving argu- 
ments into and results out of the actual transmis- 
sion buffers. 


The definition of the Driver Call Handler is 
as follows: 


task body. DCH is 
begin 
loop 
accept ReqCall(C: in out CALL) do 
select 
accept DC$1(...) do ... end; 
or 
accept DCSi(Al:in TAl1;...; Ax:in TAx) do 
C.X.Code := OPER'POS(OPSi) ; 
D$i.PutArg(C.B, Al,..., Ax); 
end DCSi; 
or | 
accept DCSk(...) do ... end; 
end select; 
end RegqCall; 
end loop; > 
end DCH; 


Each time the channel driver requests a call 
(entry ReqCall), DCH makes a (non-deterministic) 
choice among the Agents waiting to deliver a call 
for one particular operation (entry DCS$i), where- 
upon it sets the OpCode of the transaction record 
for that CALL and transfers the arguments into the 
associated data buffer. | 


The definition of the Server Response Handler 
shows the other side of this interface with the 
Local Channel Driver for Node, : 


task body DRH is 
begin 
loop 
~ accept DelResp(R: in RESP) do 
case OPER'VAL(R.X.Code) is 
when OPS] => ...; 
when OPSi => 
accept RRS$i(R1: out TRI1,..., 
Ry: out TRy) do 
DS$i.GetRes(R.B, R1l,..., Ry); 
end RRSi; 


when OPSk = ...; 
end case; 
end DelResp; 
end loop; 
end DRH; 


Each time LCD delivers a response (entry DelResp), 
DRH decodes the Opcode appearing inthe transaction 
record of that RESP and then accepts the pending 
response request from the agent for that operation 
(entry RRSi), transferring the corresponding result 
data. | 


The outline of the body for a Driver package 
is shown below: 


package body DS$i is 


task AGT is 7 
entry Exec(Al: in TAl;...;Ax: in TAx; 
Rl: out TR1;...; Ry: out TRy); 
end; 


procedure P(Al: in TAl;...;Ax: in TAx; 
Rl: out TR1;...; Ry: out TRy) 
renames AGT.Exec; 


procedure PutArg(...) is ... end; 
procedure GetRes(...) is ... end; 


»»- + body of AGT 
end DSi; 


The (sole) Agent for the operation PS$i is simply 
defined as a task having an entry Exec (with the 
same signature), and the operation is renamed to 
be a call to this entry (which is sufficient to 
ensure the desired property--that calls from the 
application tasks of each node will be serviced 
sequentially). In addition, the low-level opera- 
tions PutArg and GetRes are defined herein (pre- 
sumably in terms of representation specifications 
and/or untyped conversions). 


Finally the body of the agent task for PS$i is 
defined as follows: 


task body AGT is 
begin 
loop 
accept Exec(Al:in TAl1;...;Ax:in TAx; 
Rl:out TR1;...;Ry:out TRy) do 
DCH.DCSi(Al,..., Ax); 
DRH.RRS$i(R1,..., Ry); 
end Exec; 
end loop; 
end AGT; 


For each successive external call to the entry 
Exec (while the calling process is held in rendez- 
vous), the Agent first delivers the call to DCH 
and then requests the response from DRH. Because 
these transactions take place within the rendez- 
vous itself, arguments and results need only be 
copied once (via the operations PutArg and GetRes) 
upon actual transmission. 


4.1.2. The Server Side. The server side of 
the Remote Entry Call protocol is essentially sym- 
metric to the driver side. The overall structure 
and associated data-flow for this side are shown 
in Fig. 4-3. The Local Channel Server (LCS) for- 
wards incoming calls. from. connected nodes to the 
Server Call Handler (SCH), and transmits the cor- 
responding responses as dispatched by the Server 
Response Handler (SRH). As before, these handlers 
are formulated as independent processes (so as not 
to constrain the order of transactions with the 
underlying communications medium) and play a purely 
intermediary role. The actual calls to a locally 
supported operation P$i are performed by one of a 


FIGURE 4-3: 


Overall Structure and Data-Flow on the Server Side for the Remote Entry 
Call Protocol. 


number of Surrogate processes (SGTi), which act as 
stand-ins for the original calling processes within 
some other node. Thus, there exist multiple surro- 
gates for each remotely callable operation, which 
serve both to "buffer" incoming calls and outgoing 
responses (along with their associated transaction 
records) as well as to invoke the actual operation 
in question (as provided by one of the application 
modules Al...Am supported by Node}). 


The implementation of the server side for 
Node, is defined in the (non-generic) package body 
Support, shown in outline form below: 


package body Support is . 
-~- Server Side, defined in Config.Node$h; 


task SCH is 
entry DelCall(c: in CALL); 
entry RCS$1(...); 


eee 


entry RCSi(XR: out XREC; Al: out TAL sided 
Ax: out TAx); 
entry RCSk(...); 
end; 


task SRH is 
entry ReqResp(R: in out RESP); 
entry DRS1(...); 


entry DRSi(XR: in XREC; Rl: in TR1;...; 
Ry: in TRy); 
entry DRSk(...); 
end; 


package LCS is new ChnServer ( 
From => Conn, To => Host, 
Deliver => SCH.DelCall, 
Request => SRH.ReqResp) ; 


package S$l is ... end; 
package S$i is | 
procedure GetArg(B: in BUFF; Al: out TAI1;..., 
, : Ax: out TAx); 
procedure PutRes(B: in out BUFF; _ 
' Rl: in TR1;...; Ry: in TRy); 
end SS$i; 


package S$k is ... end; 


+ bodies of SCH, SRH, S$l, ..-, S$k 


end Support; 


The handler processes are again directly specified 
as Ada tasks (SCH and SRH) and the communications 
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interface is obtained by generic instantiation of 
the definition ChnServer for the overall configur- 
ation. As on the driver side, separate Server 
packages S$1...S$k are introduced here for each 
individual operation P$1...P$k that can be called 
remotely. 


The definition of the Server Call Handler is 
as follows: 


task body SCH is 
begin 
loop 
accept DelCall(C: in CALL) do 
case OPER'VAL(C.X.Code) is 
when OPS1 
accept RCS$i(XR:out XREC; Al:out TAI;...; 
Ax:out TAx) do 


e 
eeegze 


XR := C.X; 
S$i.PutArg(C.B, Al, 
end RCSi; 
when OPSk = 
end case; 
end DelCall; 
end loop; 
end SCH; 


shee ae 


e 
eoee?s 


Upon delivery of a new call from LCS (entry Del- 
Call), SCH switches on the OpCode and accepts a 
request for a call to the spécified operation 
(entry RC$i) from the naxt of the (possibly many) 
Surrogates which are queued up on the corresponding 
entry. This dispatching consists simply of copy- 
ing the transaction record contained within this 
particular CALL and transferring the associated 


arguments (via the operation PutArg provided by 
S$i). 


The definition of the Server Response Handler 
is like that of the Call Handler on the driver 


side: 


task body SRH is 


begin 
loop 
accept ReqResp(R: in out RESP) do 
select 
accept DRS1(...) do...end; 
or 
accept DRSi(XR: in XREC; Rl: in TRI1;...; 
. Ry: in TRy) do 
R.X := XR; 
PutRes(R.B, Rl,..., Ry); 
end DRSi; 
or 


accept DRSk(...) do...end; 
end select; 
end ReqResp; 
end loop; 
end SRH; 


Each time LCS requests a new response (entry 

ReqResp), SRH makes an arbitrary choice among pen- 
ding responses ready to be delivered for any oper- 
ation (entries DR$1...DR$k), whereupon the original 


transaction record and corresponding output results 
are copied into the RESP, to be transmitted back 
to the node from which that particular call ori- 
ginated. | 


The definition of a Server package SSi has 
the following form: 


package body SS$i is 
subtype SID is NATURAL range 1..Card(Conn) ; 


task type SGT; 
ST: array (SID) of SGT; -- surrogate tasks 


procedure GetArg(...) is ... end; 
procedure PutRes(...) is ... end; 


--- + body of SGT 
end SSi; 


The Surrogates for the operation PSi are introduced 
as an array of tasks, the range of which is set 

to the cardinality of the incoming connections 
(which would be the maximum number needed if 

every connected node did indeed call the operation 
in question). The operations GetArg and PutRes 
are presumably the inverses of PutArg and GetRes, 
which were present on the driver side. . 


Finally, each individual surrogate for PSi is 
defined as follows: 


task body SGT is 


XR: XREC; 

Al; TAL 

Ax: TAX; 

Rl: TRL; . 

Ry: TRy; 
begin 

loop 


SCH.RCSi(XR, Al,..., Ax); 
NodeSh.PSi(Al,..., Ax, Rl,..., Ry); 
SRH.DRSi (XR, Rl,..., Ry); 
end loop; 
end SGT; 


In a cyclic fashion they simply request a call 
from SCH, invoke the local operation provided by 
Node,,, and deliver the corresponding response 
(along with the original transaction record) to be 
dispatched by SRH. Once again, because the dis- 
patching is handled within a rendezvous, informa- 
tion is copied directly between the individual 
Surrogates and an incoming CALL or outgoing RESP. 


It should be noted that no special precautions 
are taken on the server side to ensure the basic 
property of the Remote Entry Call protocol (at most 
one call in progress to each operation from any 
given node); this is solely a concern on the driver 
side. The servers simply invoke the local opera- 
tions in question. If these have been specified 
as entries, then those calls will indeed be serv- 
iced sequentially; otherwise they will proceed con- 
currently. 
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What is of significance on the server side, 
however, is the fact that there are exactly as 
many Surrogates for each operation as there are 
Agents in total (distributed among the possible 
caller nodes). This property, referred to as load 
balaneing, is fundamental to the solutions devel- 
oped here, in that it ensures that this protocol 
does not require any additional storage capacity 
within the underlying communications medium. nor 
any other form of buffering than that provided by 
the Surrogates themselves. This same property also 
guarantees that the communications interface will 
never be unduly tied up (since there will always 
be an available Surrogate ready to proceed). 


4.2 The Remote Procedure Call 


In this section, we develop an alternative to 
the Remote Entry Call protocol, wherein we allow 
a (bounded) number of calls to the same operation 
to be in progress concurrently within a given 
caller node (while still maintaining the overall 
load balancing that characterized our first solu- 
tion). This somewhat more general strategy is 
described as a modification to the approach devel- 
oped initially. 


The point of departure for this strategy is 
to slightly extend the initial specification for 
the application as a whole: 


package Config is 


type NODE is (NNS1, NNS$2, ..., NNSn); 
type NSET is array (NODE) of BOOLEAN; 
subtype CONC is INTEGER range 0O.....; 
| -- Max Concurrency 
package NodeSl1 is ... end; 


package NodeSh is 
type OPER is (OPS1, OPS$2, ..., OPSK); 


type MPLX is array (OPER) of CONC; 
-- other type definitions ... 


Host: constant NODE := NNSh; 

Conn: constant NSET := (... => True, others 
=> False); 

Load: constant MPLX := ...; 


-- other constant declarations ... 


generic 
Site: in NODE; 
Usag. in MPLX; 
package Service is 
procedure P$1 (...); 
procedure PSk (...); 
end Service; 


end NodeSh; 


package Node$n is ... end; 


end Config; 


The changes are wholly concerned with this added 
(potential) concurrency: 


A subtype CONC is introduced, whereby the 
maximum degree of concurrency anywhere 
within the system is specified; 

- Within the package specifying each Node, a 
type MPLX is defined, values of which indi- 
cate a degree of concurrency on an opera- 
tion-by-operation basis; 

~ A constant load (of type MPLX) is defined 
for each Node,, whereby the limits on the 
overall concurrency (from all callers) are 
established for every such node; 

- An additional generic parameter Usag (of 

type MPLX) is introduced for the Service 

package, so that the degree of concurrency 
for individual caller nodes may be set upon 
subsequent instantiation. 


Minor modifications are also introduced into 
the body of the package Config, wherein the over- 
all communications conventions are established: 


with Medium; 
package body Config is 


subtype OPID is INTEGER range 0..... ; 
subtype RCID is CONC range 1..CONC'LAST; 


type XREC is record 
Orig, Dest: NODE; 
Code: OPID; 
Iden: RCID; 

end record; 


type BUFF is ... ; 

type XTYP is (XC, XR) 

type XMIT(T: XTYP) is record 
X: XREC; 
B: BUFF; 


end record; 
subtype CALL is XMIT(XC); 
subtype RESP is XMIT(XR); 


generic 
From, To: in NODE; 
with procedure Request (C: 
with procedure Deliver (R: 
package ChnDriver; 


in out CALL) ; 
in RESP); 


generic 
From: in NSET; 
To : in NODE; 


with procedure Request (R: 
with procedure Deliver (C: 
package ChnServer; 


in out RESP); 
in CALL); 


package body ChnDriver 


package body ChnServer is... use Medium; 


package body NodeSl is separate; 


package body NodeSn is 
end Config; 


separate; 


is -.. use Medium; ... end; 


27h end; 
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The changes are to define an additional subtype 
RCID, which will serve to identify a particular 
remote call originating from a given node (since 
the OpCode alone will no longer be sufficient for 
this purpose), and to add a new component Iden 
(of type RCID) to all transaction records. 


The only changes within the definitions of the 
separate nodes of the application would be to sui- 
tably set the generic parameter Usag upon each 
instantiation of the package Service: 


separate (Config) 
package body NodeSh is 


pragma SYSTEM(...); 
-~-- Specify local application modules: 


package AS$1 is 
procedure QS1(...); 
procedure QSf(...); 

end AS1 

package AS$m is 
procedure Q$1(...); 
procedure Q$q(...); 

end ASm; 


-- Local (re)definition of services: 


procedure PS$i(...) renames ASa.Q$b; 


~- Support services called remotely: 


package Support; 
package body Support is -- Server side of Protocol 


eee? 


end Support; 


package body Service is -- Driver side of Protocol 


end Service; 
-- Provide services needed locally: 


package NodeSu is new Config.NodeSu.Service 
(Site => Host, Usag => ...); 

package NodeSv is new Config.NodeSv.Service 
(Site => Host, Usag => ...); 


package body AS1 is separate; 
package body ASm is separate; 
end Nodesh; 

4.2.1. The Driver Side. The changes on the 
driver side in going from the Remote Entry Call to 
the Remote Procedure Call are concerned with keep- 
ing track of the identity of calls in progress. 

At the first level, this involves adding and addi- 
tional ID parameter to the DCSi entries of the 
Driver Call Handler (DCH), and of introducing a 
Post Response procedure (PR) to each of the Drier 
packages DS$1...DSk: 


package body Service is | 
-~- Driver Side, defined in Config.Node$h: 


task DCH is 
entry RegCall(C: in out CALL); 
entry DCS1(...); 


entry DC$i(ID: in RCID; Al: in TAl; ...; 


Ax: in TAx); 


eee 


entry DCSk(...); 
end; 


task DRH is 

entry DelResp(R: in RESP); 

entry RRS$1(...); 

entry RRSi(R1: out TR1; ...; Ry: out TRy); 
- entry RRSk(...); 
end; 


package LCD is new ChnDriver ( 
From => Site, To => Host, 
Request => DCH.ReqCall, 
Deliver => DRH.DelResp) ; 


package DS$l is ... end; 
package DSi is 
procedure P(Al: in TAl;...;Ax: in TAx; 


Rl: out TR1;...;Ry: out TRy) 


procedure PutArg(B: in out BUFF; 


Al: in TAl: ...; Ax: in TAx); 


procedure GetRes(B: in BUFF; Rl: out TR1;...; 
Ry: out TRy); 


procedure PR(ID: in RCID) 
end DSi; 


package DS$k is .. end; 
procedure P$1 (...) renames DS1.P; 
procedure PS$k (...) renames DS$k.P; 


--. + bodies of DCH, DRH, DS1, ..., DSk 


end Service; 


accept DCSk(...) do ... end; 
. end select; 
end RegqCall; 
end loop; 
end DCH; 


The corresponding modifications to DRH involve 
its passing that identity to the appropriate PR 
procedure prior to accepting a request to dispose 
of each incoming response: 


task body DRH is 
begin — 
loop 
accept DelResp(R: in RESP) do 
case OPER'VAL(R.X.Code) is 
when OPS] => ...; 
when OPSi => 
D$i.PR(R.X.Iden); 
accept RRS$i(R1l: out TRI,..., 
Ry: out TRy) do 
D$i.GetRes(R.B, R1l,..., Ry); 
end RRSi; 
when OPSk => ...; 
end case; 
end DelResp; 
end loop; 
end DRH; 


Within a Driver package D$i, the modifications 
consist primarily of introducing a multiplicity 
of Agents for the same operation (whereas there 
was only one heretofore). As shown on the next 
page, this is accomplished by defining an array of 
agent tasks (AT), the range of which is esta- 
blished by the Usag generic parameters. Thus, the 
index in this array (of type AID) will serve to 
uniquely identify a particular call-in-progress 
for the operation P$i. At the same time, addi- 
tional entries have to be provided for the AGT 
task: these are Init (whereby an Agent acquires 
its own identity) and Done (whereby it may be no- 
tified that the response for the call it is car- 
rying out has been received). The procedure PR 
is essentially a call to this latter entry. A 
further task, the Agent Manager (AM) is now needed 
to establish the initial correspondence between 


The definition of DCH is then modified to store the original call (from some application process) 
the identity of each call as part of the transac- 


tion record which it forwards: 


task body DCH is 
begin 
loop 
accept ReqCall(C: in out CALL) do 
select 
accept DC$1(...) do ... end; 
or 
accept DCSi(ID:in RCID; Al:in TA1;...; 
; Ax:in TAx) 
C.X.Code. := OPER'POS (OPSi); 
C.X.Iden := ID; 
D$i.PutArg(C.B, Al,..., Ax); 
end DC$i; 


or 


do 


and the particular agent which will perform that 
transaction. This correspondence is created by 
the procedure P, which is called (concurrently) by 
every application process seeking to invoke the 
remote operation P$i. 


package body DSi is 
subtype AID is RCID range 1..Usag(OP$i); 
task type AGT is 
entry Init(A: in AID); . 
entry Exec(Al: in TA1;...;Ax: in TAx; 
Rl: out TR1;...; Ry: out TRy); 
entry Done; a 


end; 


AT: array(AID) of AGT; 
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task AM is 
entry Ready(A: out AID); 
entry Avail(ID: in AID); 
end; 


procedure P(Al: in TAl;...;Ax: in TAx; 
Rl: out TR1l;...; Ry: out TRy) is 
A: AID; 
begin 
AM.Ready (A); 
AT(A) .Exec(Al,...,Ax, Rl,...,Ry); 
end; 


procedure PutArg(...) is ... end; 
procedure GetRes(...) is ... end; 


procedure PR(ID: in RCID) is 
begin 

AT(AID; (ID) ) .Done; 
end; 


--- + bodies of AGT, AM 
end DSi; 


The initialization and actual allocation of agents 
is handled by the Agent Manager: 


task body AM is 
begin 
for A in AID loop 
AT(A).Init (A); 
end Loop; 
-- main cycle: 
loop 
accept Ready(A: out AID) do 
accept Avail(ID: in AID) do 
A := ID; 
end; 
end; 
end loop; 
end AM; 


Each of the agent tasks of the array AT is then 
defined as follows: 


task body AGT is 


ID: AID; 
begin 
accept Init (A: in AID) do 
ED y= As 
end; 
-- main cycle: 
loop 


AM.Avail (ID) ; 
accept Exec(Al:in TA1;...;Ax:in TAx; 
Rl:out TR1;...;Ry:out TRy) do 

DCH.DCSi(ID, Al,..., Ax); 
accept Done; 
DRH.RRSi(R1,..., Ry); 

end Exec; 

end Loop; 
end AGT; 


After initialization an Agent enters its main cycle, 
wherein it first makes itself available to AM prior 
to accepting the resultant call via its entry Exec. 
Within the corresponding rendezvous, it delivers 
its own identity to SCH along with the arguments 
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for the call in progress, it then awaits notifica- 
tion (via the entry Done) that the response for 
that particular call has been received before pro- 
ceeding to request the results on behalf of the 
original caller. 


4.2.2. The Server Side. In passing from the 
Remote Entry Call to the Remote Procedure Call 
protocol, essentially no modifications are 
required on the server side (since this latter 
already provided for some degree of concurrency, 
insofar as it had to handle incoming calls from 
more than one caller node). The only provision 
that must be made is to possibly increase the num- 
ber of Surrogates for each operation PS$i, which 
would be specified within the corresponding Server 
package S$i as follows: 

subtype SID is CONC range 1..Load (OPSi); 
thereby fixing the number of elements in the array 
of surrogate tasks. This will presumably preserve 
the overall load balancing (number of Surrogates = 
total number of Agents, for each operation Pi) upon 
which both of the protocols developed in this sec- 
tion have been based. 


6. Conclusion 


This paper has addressed the problem of pro- 
gramming distributed applications in Ada and out- 
lined a first approach in this area. Essentially 
two aspects have been considered: the provision 
of a suitable compile-time framework for defining 
such applications in the first place (which was 
achieved by exploiting the possibilities of the 
separate compilation facilities in Ada); and the 
support of a suitable "interprocessor procedure 
call" protocol, whereby the application itself 
could then be programmed without further regard to 
the distributed nature of the underlying hardware 
configuration (a capability which was defined in 
terms of the multi-tasking facilities of Ada). 
Several such protocols were in fact developed here, 
beginning with the relatively simple Remote Entry 
Call, which was then extended to yield the Remote 
Procedure Call strategy. In [4] we further exten- 
ded this approach so as to take into account the 
unreliability of the transmission medium in ques- 
tion, while still assuming that the nodes within 
the overall configuration were perfectly reliable. 
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Abstract -- A procedural single assignment 
language, SALAD, is presented, and its implemen- 
tation on a distributed, multicomputer system is 
discussed. A procedure is executed not by a 
Single task, but by a collection of cooperating 
tasks (threads of control) that share the pro- 
cedure's activation record and synchronize with 
event variables and semaphores. Procedure calls 
and returns are handled with message passing, 
permitting the called procedures to be executed 
On remote machines. SALAD includes state-main- 
taining objects, e.g. queues, which violate the 
spirit of single assignment lanquages, but pro- 
vide more usual multitasking facilities. Not 
only is a SALAD program to be able to run distri- 
buted over a computer network, but the compiler 
itself is to be able to execute on such a system. 
The compiler tries to optimize both the code for 
SALAD procedures and the distribution of the code 
over the network. 


Introduction 


The proliferation of distributed computing 
systems has intensified the software crisis. The 
development of distributed hardware has far out- 
stripped the development of the necessary soft- 
ware. We are primarily concerned with systems 
composed of microcomputers connected in a net- 
work. There is a strong need for compilers which 
can both run on a distributed system and generate 
code for that same system. The development of 
compilers for such systems presents two novel 
problems: distributed compiler orqanization and 
distributed code generation. Since different 
components of the compiler will run on separate 
microcomputers it must be organized into small 
modules which can function independently and ex- 
change information only by messages. This com- 
piler must also be capable of generating distri- 
buted code. First it must partition the code 
into clusters of tasks small enough to fit on the 
separate computers of the distributed system. 
Then run time routines are added to control the 
run time system, to handle error conditions, and 
to make each cluster capable of standing alone. 
Finally, the compiler inserts message-passing 
primitives in each cluster to provide the 
exchange of values between the clusters at run 
time. The problem of automatic partitioning jis 
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a central issue in the design of these new com- 
pilers, but the problems of partitioning conven- 
tional general-purpose programming languages are 
tremendously complex. Single-assignment lan- 
guages (languages in which no variable is assign- 
ed a value more than once) seem to be much 
easier to handle. This paper describes the de- 
Sign and the implementation of a simple but non- 
trivial single assignment language (SALAD). The 
compiler is organized as a pipeline of small 
modules each of which can reside on a separate 
microcomputer; it includes a crude partitioning 
module which divides the intermediate code into 
self-contained clusters which are then converted 
into separate load modules for separate com- 
puters. 


There is a strong relationship between our 
work and current work on dataflow computers 
C1, 2]. In fact, we are motivated by the belief 
that it is possible to obtain the advantages of 
data-flow architecture without the expense of 
specialized hardware. There are important dif- 
ferences, however, in our methods. We have not 
simply programmed microcomputers to behave like 
data-flow computer components. Instead we have 
encoded data-flow operations as multiple com- 
municating processes. 


Our model of a distributed computing system 
is a collection of many independent computers 
with no shared memory, so that all communication 
is by means of messages only. Why assume that 


there is no shared memory? While systems exist 


with multiple processors accessing a common 
memory, systems without shared memory are easier 
and cheaper to build, since they can be produced 
by adding communication channels to existing 
machines. When those with shared memory systems 
want to link them together, they will be sub- 
ject to these constraints as well. Furthermore, 
compilers designed for systems without common 
memory can be made to run on shared memory 
systems, but the reverse is not true. 


Our interest in the twin problems of dis- 
tributed compiler organization and the genera- 
tion of distributed code began with the TECHNEC 
project, a very successful project which was 
funded by the National Science Foundation 
(NSF-MCS76-01310). TECHNEC, the Illinois 
Institute of Technology Network Computer, is a 
ring network of LSI/11's (3, 4]. It was design- 
ed to support Greene's experiments in heuristic 
control [5]. The first step was a distributed ; 
operating system [6, 7]. We have also designed | 
and implemented a demon language for TECHNEC, | 
but the current implementation is not truly a 


distributed compiler, it is a cross-compiler run- 
ning on the PRIME 400 but generating code for 
TECHNEC. Demons do, however, present particu- 
larly exciting problems in the design of the run 
time environment. 


Our first true distributed compiler was the 
DYNAMO compiler [8] and [9]. This language was 
chosen because we needed a continuous simulation 
language for work on robotics and we were fasci- 
nated with the challenge of simulating parallel 
processes on a network of parallel processors. 
This compiler was designed from the beginning to 
run on our distributed system and to generate 
automatically partitioned code for that system. 


We have experimented with four different 
partitioning algorithms for this compiler [10, 
11]. DYNAMO is a nonprocedural language with no 
explicit control structures; this makes it rela- 
tively easy to partition. It can even be viewed 
as a single assignment language, although it does 
not resemble pure LISP and those applicative 
languages which are usually called single assign- 
ment lanquages [12, 13]. We felt that the next 
step toward our goal of eventually developing 
mechanisms for conventional programming lanquages 
ought to be a compiler for a language that com- 
bines the constraints of a sinale assignment 
lanaquage with procedures, explicit control struc- 
tures, and at least some multiple data structures. 
Since we did not know of such a language we de- 
cided to design one ourselves; the result is 
SALAD. 


Description of the Language SALAD 


A program consists of a collection of pro- 
cedures. (See Figure 1 for grammar.) A proce- 
dure declaration consists of:a header line, zero 
or more declaration lines, one or more command 
lines, and-an END line. The procedure header is 
of the form 

PROC outputs = procedure-name inputs 
where either outputs or inputs may be a single 
identifier or a list of identifiers in 
parentheses. 


The declarations in SALAD are optional. If 
an identifier is not declared, it may be of any 
type, and its type may be different on different 
executions of the procedure. If a name is de- 
clared, the code will check that the type of the 
value assigned to it is correct at run time. The 
form of a declaration is an identifier or a list 
of identifiers in parentheses followed by a colon 
followed by the name of the type they are being 
declared to be. 


There are three primitive types in SALAD: 
integer, real and Boolean. There are two pure 
structured types: strinas and tuples. A string 
is a sequence of characters. Once created, it 
cannot be modified. There is no theoretical limit 
on the length of a character string. 
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>:= prochd declare* command+ END newline 
prochd ::= PROC lhs = procid Ths newline 
declare ::= Ths 


proc 


: type newline 


type is one of INTEGER, REAL, BOOLEAN, TUPLE, 
STRING, FILE, QUEUE, ANY 
command ::= Ths = rhs newline 
Ths ::= id 
Ths ::= ( idlist ) 
idlist ::= id 
idlist ::= id , idlist 
rhs ::= e3 
rhs ::= IF id THEN e3 ELSE e3 
e3 ::= e2 
e3 ::= op e2 
e2 ::= el 
e2 ::= ( ellist ) 
e2 ::= ( 
ellist ::= el 
ellist ::= el , ellist 
el ::= id 
el ::= constant 
Fiqure 1. SALAD Grammar. 
A tuple is a sequence of values. Each value 
in a tuple may be of any type. As with strings, 


once created, a tuple cannot be modified. Tuples 


play a central role in the SALAD language. 


There are two kinds of executable statements 


in SALAD: simple assianments and conditional 
assignments. Simple assignments have the form 
lhs = rhs 


The left hand side, lhs, can be either a single 
identifier or a list of identifiers in paren- 
theses. If the’lhs is a sinale identifier, then 
the value produced by the rhs may be of any type. 
If, however, the lhs is of the form (idl, id2, 
.», idn) then the value of the rhs must be a 
tuple of length n. Each element of the tuple is 
oe to the corresponding identifier in the 
hs. 


If the rhs is a single identifier, then its 
value is used. If the rhs is a list of identi- 
fiers (idl, id2, ..., idm), then the value of the 
rhs is a tuple of lenath m with the value of 
identifier idj in position j. 


The right hand side could also be an opera- 
tor or function applied to either a single iden- 
tifier or a list of identifiers in parentheses. 
The single identifier form, F A, causes the 
function F to be applied to A's value. The 
form F (Al, A2, ..., An) causes function F to 
be applied to a tuple with the values of the 
identifiers Al through An. 


A conditional assignment has the form 


ths = IF b THEN rhs ELSE rhs2. 


where b is an identifier that will have a 
Boolean value at run time, and lhs and rhsl and 
rhs2 have the forms discussed immediately above 
for ie assignment statements. 


= IF B THEN c ELSE d 


ee B be a Boolean-valued identifier. If 
B is true, the statement behaves as if it were 


a=c 


If, however, B is false, then the statement 
behaves as if it were written 


a=d 


The simple and conditional assignment state- 
ments are required to obey the single assignment 
nature of the language. An identifier may appear 
only once in a lhs within a procedure. An iden- 
tifier in the inputs section of a procedure 
header may not appear in the lhs of any assign- 
ment in the procedure. 


Like conditional assignments, procedure 
calls may be defined by substitution rules. The 
definition of procedure calls in SALAD is very 
similar to the copy rule for Algol 60 procedures. 
Given the procedure definition 


PROC plhs = procid prhs 
body 
END 

and the call of the procedure 


clhs = procid crhs 
the call behaves as if it had been written 
prhs' = crhs 
body ' 
clhs = plhs' 


where the prime (') indicates that all the iden- 
tifiers are renamed uniquely to avoid conflict 
with the identifiers in use at the place of call. 


There are two other data types in SALAD that 
have not been mentioned before. They violate the 
spirit of single assignment languages in that 
they maintain an internal state, can be modified, 
and are not pure values. 


One of the two types is FILE. Since files 
encapsulate the interface to the outside, state- 
maintaining world, they must be forgiven for be- 
ing that way themselves. The other type is QUEUE. 
Queues are the objects used for synchronization. 
Any kind of object may be added to a queue. An 
attempt to remove an object from an empty queue 
will cause a delay until an object becomes pre- 
sent. There is no such thing as a full queue. 
Queues are generally handled in FIFO order, but 
simultaneous attempts to add or remove items from 
a queue will be serviced in an unspecified order. 


The operations on queues are as follows: 
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sists of the programs for the procedures located 


= QUEUE ( ) 
ql = PUT(q,val) 
(q2, vall) = GET ql 


The operation q = QUEUE ( ) creates a new 

queue object and return a pointer to it in q. 
The operation ql = PUT(q, val) puts the value of 
val in the queue pointed to by q and returns a 
new pointer to q inal. 

The operation (q2, vali) = GET ql removes an item 
from the queue pointed to by ql and returns that 
value as vall. It also returns another pointer 
to the queue in q2. 

The reason for returning new pointers to a queue 
is to permit sequencing of queue operations in the 
calling program. For example, 


= QUEUE ( ) 
ql = PUT (q,a) 
q2 = PUT (ql,b) 
(q3, c) = GET a2 


(q4, d) = GET a3 


will accomplish much the same as 


= QUEUE ( ) 
(ql, q2, q3, q4) = (q, q, q, q) 
(c, d) = (a, b) 
However, 
= QUEUE ( ) 
ql = PUT (q,a) 
q2 = PUT(q,b) 
(q3, c) = GET q 
(q4, d) = GET q 


will not necessarily accomplish the Same thing. 
Sometimes it will behave as the example above; 
sometimes it will do the assignment 


(c,d)=(b,a) 


Calls and Messages 


Since the network hardware we envision has 
no shared memory, the implementation must use 
message passing to pass data and coordinate the 
execution. Procedures are called by sending a 
call message. The results of a procedure call 
are returned in a return message. If a procedure 
needs to examine a tuple located on another ma- 
chine, it sends a message requesting a copy of 
the tuple. Operations on state-maintaining ob- 
jects, files and queues, are handled via a mes- 
sage to the computer where the object is located. 


Node Structure 
The software structure on each computer con- 
on that computer; tables used by the system; 


input queues for receipt of messages from the 
computers connected to this one; output queues to 


the connected computers; an "available operations 
queue" containing call messages for procedures 
that could be involked here; and a heap, or dyna- 
mic storage area, that contains tuples, strings, 
queues, and local storage for active procedures. 


Activation Records and Threads of Control 


An "activation record" contains the local 
storage for a procedure. In a conventional lan- 
guage, a single process, or task, would have a 
stack of activation records. Every procedure 
call would push an activation record on the stack. 
Every return would pop one off. Our implementa- 
tion does just the reverse. There are not multi- 
ple activation records per task. There are 
multiple tasks, which we call threads, per acti- 
vation record. 


Each thread is associated with a thread 
control block which contains only 1) a pointer to 
the activation record the thread is associated 
with, 2) the address of the next instruction the 
thread is to execute, and 3) a link field so the 
thread control block can be placed on queues. 


The system in each separate computer in the 
network maintains a run queue of threads. A 
piece of code, the dispatcher, removes the first 
thread control block from the run queue, loads a 
register with the activation record pointer, and 
jumps to the next instruction the thread is to 
execute. 


A simple form of coordination between 
threads uses “event variables." An event vari- 
able is initialized to require a particular nun- 
ber of "signals" before the event occurs. Only 
a single thread may wait on an event variable. 
If the event has not yet occurred, the waiting 
thread is suspended until it does occur. If the 
event has already occurred, the thread continues 
executing. Another thread may signal the event 
variable. If the signal is the last one required 
before the event occurs, and another thread is 
waiting for the event, the waiting thread is 
linked on the run queue. 


We have implemented the synchronization 
primitives in PDP-11 assembly lanquage. To wait 
On an event variable requires at most five 
instructions, whether or not the thread executing 
the wait must be suspended. To signal an event 
variable without waking up a thread requires only 
two instructions. If a signal wakes up a thread, 
that thread must be linked on the run queue. 
Linking a thread on the run queue requires less 
than a dozen instructions. 


The system also provides semaphores, which 
in addition to the usual synchronization and 
mutual exclusion functions, are used to permit 
control of the degree of concurrency at run time. 
This is mentioned again below in the section on 


optimizations the compiler can perform. 
See Figure 2 
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THREAD CONTROL BLOCK 

LNK : link field 

PC : program counter field 
FP : frame pointer field 


EVENT VARIABLE 
CNT : count field 
THRP : thread pointer (to THCB of Waiting Thread) 


READY QUEUE 

ROLCK : lock byte on ready queue (init 1) 
RQIST : pointer to head of ready queue 
ROLST : pointer to end of ready queue 


READY LIST MANIPULATION 


dispatch: -- Just a normal label 
while RQIST = NULL do diddle 
seize rq 
if RQIST = NULL then 


{ release rq 
goto dispatch} 


else 
{T := ne 
RQIST := RQIST@.LNK 
release rq 
FP_REG := T@.FP 
PCA} 


JUMP TQ. 
procedure ready 1 (t) = 
= NULL 


{ t@.LNK : 
seize rq 
if RQIST = NULL then 
{RQIST :=t 
RQLST :=t} 
else 
{RQLST@.LNK : 
RQLST := t} 
release rq} 
return 
end readyl 


SIGNAL OPERATION ON EVENT VARIABLE 
signalevent s 
-- is translated into 
decr s.CNT 
if zero then 
ready1(s.THRP) 


WAIT OPERATION ON EVENT VARIABLE s BY THREAD t 
waitevent s 
-- by a thread with thread with 
-- thread control block t 
-- is translated into 

t.PC := @L 

s.THRP :=@t 

decr s.CNT 

if positive then 

goto dispatch 
Ls 


INITIALIZE EVENT VARIABLE 
initevent s,c 
-- is translated into 
s.CNT := ctl 


Figure 2. Threads and Events 


Distributed Garbage Collection 


Tuples, strinas, queues and files are dyna- 
mically created during the course of the program. 
Their bodies occupy memory on the heap on the 


computer they were created on. When they are no 
longer needed, the storage they occupy must be 
reclaimed for other uses. Tuples and strings 


cannot be created with cycles of containment, e.g. 


if tuple A is created with tuple B as a con- 
ponent, then since B was created first, it can- 
not contain A as a component. Moreover, B 
cannot be modified to contain A. At most, a 
copy, C, of B can be created with a component 
changed to be A. 


Thus, the containment graphs for tuples are 
directed, acyclic graphs, and reference count 
storage reclamation is adequate. The state-main- 
taining objects, files and queues, might cause 
problems. 


Files do not, in fact, cause problems since 
neither files nor queues nor tuples containing 
files or queues may be written into them. Queues 
may cause problems; a queue can be placed into 
itself, e.q.: 


= QUEUE ( ) 


ql = PUT (q,q) 


It is, therefore, possible to creat sireiiarty 
linked, inaccessible structures which will not be 
reclaimed with a reference count scheme. 


Such structures are likely to be rare. So 
we are considering implementing a reference count 
storage management algorithm. We do have, in 
addition, a°’full garbage collection algorithm 
that will mark all accessible structures and re- 
claim those that are inaccessible. It works in 
two phases: 1) it marks all accessible struc- 
tures Dy naving each computer mark those that are 
accessible locally and send messages to the com- 
puters containing those that are remote; after 
all computers have finished the mark phase. 

2) it has each computer reclaim the unused stor- 
age in its own heap. 


We have no plans to make the garbage collec- 
tion algorithm run concurrently with normal 
processing. There are thoughts that a single 
computer can do some local garbage collection 
independently of the others. Objects on the heap 
can be marked when pointers to them are sent to 
other computers in messages. Unmarked objects 
are only pointed to locally, if at all, and can 
be collected by the computer on which they are 
contained. 


Structure of the Compiler 


The compiler for SALAD is composed of four 
sections. The first section translates from the 
source language into intermediate code. The se- 
cond section optimizes the intermediate code and 
translates procedures into parallel cooperatina 
threads of control. The third section allocates 
the procedures to separate computers. The fourth 
section converts into assembly-like code which it 
optimizes and assembles. Each section is com- 
posed of several phases which can be run as sepa- 
rate passes or, in some cases, as a pipeline. 
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Code aptinizatton and allocation are described - 
in more detail below. 


Translating Procedures into Parallel Threads 


When control enters a procedure it is exe- 
cuting a single, main thread associated with the 
procedure. The main thread creates the other 


threads in the procedure to execute concurrently 


with it. See Figure 3a for a collection of 
procedures and Figure 3b for an example of the 
code that could be generated for that collection. 


Figure 3a below: 


PROC C= FN , 
C = Fl (1,1,N) | 


F1(1,0,M) | 
J) | 
1,M) 

THEN I ELSE F2(1I,d,M) 


me —1Iv 


Figure 3a. SALAD code for factorial Function F. 

The compiler must partition the operations 
in a procedure into a collection of threads. 
There are two main rules for placing operations 
into threads: 1) Two operations may be placed 
in the same thread only if one is dependent on 
data from the other; 2) The operations must be 
placed in the thread in the order they must be 
executed. The first rule is to prevent one 
operation that could be executed from being de- 
layed waiting for completion of an operation 
that does not have to precede it. The second 
rule is obvious. 


If an operation, A, in one thread requires 
as input a name computed by a operation, B, in 
another thread, the first thread must wait on a 
event variable, EA, before executing A, and the 
other thread must Signal EA after Finishing 
operation B. 


The compiler performs the following optimi- 
zations on threads where applicable: 


1) Recursion removal -- a tail-end recur- 
sive call to the procedure that includes the call 
can sometimes be replaced with an assignment to 
the input parameters and a jump back to the 
beginning. 


2) Code incorporation -- a procedure called 
in only one place may be incorporated into the 
place of the call. Recursion removal, by elim- 
inating a place of call may make more code incor- 
poration possible, which may in turn permit fur- 
ther recursion removal. Also, small, non-recur- 


sive procedures may be incorporated into several 
places of call. | 


3) Within the constraints mentioned above 
about the placement of operations in threads of 
control, the compiler tries to minimize the 
number of threads generated for a procedure. The 
constraint that two operations may be placed in 
the same thread only if there is a data depend- 
ency between them makes the minimization of the 
number of the threads. an NP-hard problem: it 
can be shown equivalent to graph coloring. 


4) It will often be possible to eliminate 
some event variables and signals. In particular, 
an operation A_ that defines a value used by 
Operation B need not signal an event for B_ if 
there is an operation C that uses a value sup- 
Plied by A and supplies data to B. 


5) For those procedures where it is permis- 
Sible, the compiler will generate code that will 
choose at run. time whether to execute only one 
Operation in the procedure at a time or to exe- 
cute with as much concurrency as possible. We 
hope this code will keep the system from becoming 
Swamped by concurrent operations. The trick is 
to protect. the activation record with a semaphore 
and let the threads compete for it. See Figure 4 
for a picture of the desired behaviour. 


6) "Sending off" tail-end calls -- Some- 
times it is possible for a procedure call as the 
last operation of a procedure A_ to tell the 
called procedure to return its values to the 
caller of A; not to A itself. After sending 
off this call, A's activation record may be 
deleted. 


The Partitioning Module 


The purpose of the partitioning module is to 
specify how the code should be partitioned and 
assigned to the different computers of the net- 
work in order to achieve load balancing, while 
minimizing the communication overhead. The out- 
put from the partitioning module includes not 
Only code clusters but enough information about 
runtime data flow so that the code generation 
routines can produce runtime modules capable of 
Standing alone on separate computers and communi- 
cating via messages. 


The basic requirements for partitioning 
module design are considered to be: 


1. The module can be incorporated as an 
integral part of a pipelined compiler i.e. it 
should accept information concerning source pro- 
gram in the form of a stream of messages each of 
them containing information regarding one source 
statement. The messages arrive one at a time. 
The format of the messages and their contents 
should be compatible with the output of the front 
end of the compiler. The last stage of the par- 
titioning module should produce groups of state- 
ments of intermediate language (partially com- 
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piled code) compatible with the input language of 
the back end stage(s) of the compiler. 


2. The partitioning module can run on the 
same network computer for which the distributed 
compiler is designed. So the same limitations on 
the size of every compiler phase apply to the 
partitioning module also. However, the parti- 
tioning module may consist of a number of phases 
distributed over the network and cooperating to- 
gether to perform the function of that module. 
In this case, the general requirements for dis- 
tributed software apply also to the distributed 
partitioning e.g. minimizing the communication 
overhead and balancing the network load. 


Figure 3b below: 


; code for F 
Fmain: send off_cal] F1(1,1,N) 
terminate 
:code for Fl's main thread 
Fimain: Tl = +(I,d) 
Bi = GT(T1,M) 
if not Bl goto L1 
return I 
oe (X,5Y,Z) = (I,0,M) 
;code for F2's main thread, incorporated into Fl 
F2main: Choose concurrency CS 


sinitialize semaphore CS 
; to 1 or infinity 
initevent el,1 
initevent e2,1 
fork F2T2, F2ThCB2 
waitsema CS 
T2 = *(Y,2) 
signalevent el 
T4 = F1(X,T2,Z) 
signalsema CS 
waitevent e2 
waitsema CS 


F =*(T4,T5) 

return —E 
;code for F2's dther thread, T2 
F2T2: waitsema CS 


T3 = +(X,Y) 
signalsema CS 
waitevent el 
waitsema CS 
T5 = F1(T3,T2,Z) 
signalevent e2 
signalsema CS 
die 
Translation of code for factorial into 
optimized pseudo-code. 


Figure 3b. 
In our desian, the partitioning module (PM) 

is composed of three phases:. 

Phase I: Data structure builder 

Phase II: Partitioner 

Phase III: Allocator 


The data structure building phase serves as 
a functioning part of the front end pipe during 


compile time, f.e. it processes one statement at 
atime. It builds a standard representation of 
programs using such things as data dependency 
among statements. The kind of data structure 
built during this phase will be discussed in de- 
tail in the following paragraphs. Transforming 
known programming constructs to this standard 
representation is a major part of this research 
and will be discussed in detail in the next 
sections. The partitioner and the allocator deal 
with the representation of the whole program in 
two separate passes. They must collect global 
information about dependency among statements and 
interactions among various parts of the program 
in order to segment that program and insert com- 
munication primitives in it. For example, when 
a program is partitioned into several nodes of 
the system, a value may be needed in a node other 
than the one in which it is calculated. The 
allocator must detect such a situation and pro- 
vide for the sending and receiving of the re- 
quired value. Since a node will probably be 
receiving values for many variables, the name of 
the variable must be included with the value so 
that the receiving node can identify it. The 
data structure builder stores the program repre- 
sentation on a file for the partitioner to oper- 
ate on it in a new pass. The graphs which are 
used as standard representation for programs are 
described elsewhere [14, 15]. | 


The partitioning problem can be divided into 
two basic strategies: verticle vs. horizontal 
partitioning. Imagine you have program listing 
and you draw horizontal lines on it. All the 
code between a pair of lines is placed in the 
same computer. This we call horizontal parti- 
tioning. , 


The easiest place to draw these lines is at 
subroutine boundaries. An entire subroutine is 
placed in a single computer. A subroutine is 
called by the arrival of a message providing its 
parameters, and it sends back a message when it's 
done. Since we assume parallelism is provided, 
or at least permitted, by the language being 
implemented, there may be several calls to the 
subroutine concurrently. Thus the activation 
record for the subroutine is allocated on a heap, 
rather than a stack, because the calls will not 
obey a strict LIFO discipline. 


If the horizontal lines cut through the 
midst of subroutines, then when the flow of con- 
trol reaches such a cut the activation record 
must be sent to the computer containing the next 
section of code. Note that addresses must have a 
computer name associated with them, and fetches 
or stores of anything other than components of 
the activation record require (potentially) mes- 
sage passing. -Note also that an activation re- 
cord cannot be sent to another computer while 
there is a reference to one of its components 
outstanding. 


As an alternative to horizontal partitioning, 


imagine you take a program listing and draw 
vertical lines on it making several columns be- 
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Side the program. Each column represents a dif- 
ferent computer. Beside the statements you make 
check marks in one or more of the columns. A 
check mark in a column beside a statement indi- 
cates that the statement is to be placed in the 
computer the column represents. The statements 
of the program have been partitioned among the 
several computers so that the computers work in 
parallel on the program, sending messages to each 
other when a value is computed in one that is 
needed in the other. Some statements may be re- 
presented in all of the computers, e.g. loop 
control. Each computer contains a part of the 
activation record of the routine being executed. 
We call this vertical partitioning. 


The partitioning heuristics developed for 
the DYNAMO compiler assigned each statement to 
precisely one computer. We noticed that in de- 
signing the DYNAMO compiler as a large distri- 
buted program we found that the high communication 
overhead made it more efficient to repeat some 
portions of the code in several different com- 
puters. 


For the SALAD compiler we have developed a 
simple divide-and conquer style algorithm, where 
a procedure divides a problem into several small- 
er problems and calls itself recursively and in 
parallel for each. To get any speed-up from this, 
copies of the procedure must be located in 
several computers. 


Our past work on partitioning has assumed 
that all partitioning should be done at compile 
time. Now that we have copies of some procedures 
in several computers , it seems that we may be 
able to baiance the jioad better by delaying wit 
run time the decision of where to execute a 
particular procedure call. 
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Future Plans 


The obvious next step is further experiments 
with the partitioning of SALAD. We would like to 
write at least some part of the SALAD compiler in 
SALAD and compare the compiler output with what 
we have done by hand. In the long run more 
theoretical work on the partitioning is certainly 
necessary. In the next year we hope to design a 
distributed partitioning compiler for a subset of 
Pascal. 
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Curved Vines -- density at which system chooses sequential execution 
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Procedure call tree resulting from dynamic choice of degree 
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Summary 


The Control Data Corporation STAR-100 is a 
very-high performance vector processor[1]. A 
language known as SL/1 [2] that is oriented to 
scientific applications programming and which 
allows good program structure was designed and 
implemented by the authors for the STAR-100, and 
is now being used for many applications. SL/1l is 
also used with the CDC CYBER-203 but the work 
reported here was done using the STAR-100. In 
this paper we discuss the optimizations performed 
by the SL/1 compiler and report a series of 
measurements of. the effects of these 
optimizations. The advent of vector processors 
and vector oriented languages such as_ SL/1 
produces a new environment for. scientific 
computation. Programs written for vector 
computers will be sufficiently different from 
their scalar counterparts that the effects of 
optimization in a compiler may be different. The 
primary optimizations of interest in the SL/1 
compiler are common subexpression elimination, the 
movement of invariant code out of loops, and the 
elimination of unecessary vector temporaries. In 
order to get some information about the effect of 
optimizing programs written in a vector language, 
the performance of the optimizer in the SL/1 
compiler was measured. 

There are two hardware characteristics of the 
STAR-100 which are of importance in optimization. 
First, the hardware supports vector instructions 
with vector lengths between zero and 65,535, and 
the execution time of a vector instruction is 


proportional to its length after an initial 
start-up delay. For floating point addition, the 


longest vector instruction requires approximately 
one and one third milliseconds while the’ shortest 
requires only approximately three microseconds; a 
ratio of about 400 to _ one. Under ideal 
circumstances, a scalar floating point addition 
requires only 0.16 microseconds; a ratio of almost 
10,000 to one compared to the longest vector 
instruction but only about 20 to one compared _ to 
the shortest. These ratios are important because 
the optimization techniques to be discussed are 
only applied to scalar operations. Vector 
operations are always included in SL/1l programs 
explicitly by programmers and there is usually 
nothing redundant that can be removed. Similarly, 
vector instructions are rarely inside loops in 
which they are invariant. } : 

The second hardware characteristic of 
importance is the set of 256 general purpose 
registers. Variables which are used frequently 
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can be stored in registers permanently [3], and 
the values of common subexpressions: which appear 
in separate parts of a program can reside in 
registers between uses. 

The SL/1 language structure is modelled after 
SIMPL T [4]. Variables can be declared as 
scalars, vectors, or arrays. Arrays of scalars 
are not allowed and all array elements must be 
vectors. A matrix is therefore represented by a 
one dimensional array of vectors, and for a given 
matrix, the user may FUPETDECE these vectors as 
rows or columns. 

As well as basic vector arithmetic, the 
STAR-100 . hardware provides a_ variety of 
sophisticated macro operations. For example, 
forming the inner product of two vectors is a 
single machine instruction, as is the evaluation 
of a polynomial for a vector of coefficients and a 
vector of arguments. All of these macro 
instructions are available in SL/1l as special 
operators which can be used freely in building 
expressions. The compiler makes no attempt to 
recognize implicit vector operations in loops 
containing scalar computations since the language 
provides access to all the hardware vector 
facilites. 

Key elements of the language are the array 
and vector referencing notations. Variables 
deciared as vectors or arrays cai be indexed in 
the normal way yielding a vector in the array case 
and a scalar in the vector case. It is also 
possible to select a range of elements, known as a 
subvector, from a vector variable or array element 
using notations which specify the index of the 
first element and length, or the indices of the 
first and last elements. 

The SL/1 compiler is organised into three 
phases. The first phase translates the given SL/1 
module into a series of quadruples (quads). The 
second phase optimizes the quads, and the third 
phase translates these optimized quads into a 
relocatable object module. In the rest of this 
paper, the term quad is used to mean an operator 
of the intermediate form and all (possibly zero) 
of its associated operands. 

There are two important characteristics of 
the quadruple intermediate form. First the 
sequence contains quads which represent the 
control structure of the program in terms of the 
control statements of the language. This enables 
the optimizer to detect explicit program loops and 
control flow very easily. Secondly, some 
high-level operations such as indexing and forming 
subvectors translate into sequences of low-level 
quads which represent single instructions. This 
enables the optimizer to detect redundant 


computations in these high-level operations. 


For common-subexpression analysis and _ code 
motion, the design of the optimizer is similar to 
the quad improver described by Hecht [5]. — 

SL/1 allows arbitrarily complex vector 
expressions. This may result in the creation of 
temporary vectors, and these vector temporaries 
may be of different lengths. Building a temporary 
necessitates the execution of several scalar 
instructions to allocate space in virtual memory 
and increases the program’s working set size by 
the size of the vector temporary. The compiler 
attempts to minimize the number of vector 
temporaries required to evaluate an expression in 
order to reduce this. overhead. One technique 
employed is to use a single vector temporary in 
place of a number of equal length vector 
temporaries whose life spans are disjoint. This 
technique is a generalization of the algorithm 
described by Dantzig and Reynolds [6] which has 
been shown to minimize the necessary number of 
temporaries. A second technique is used only when 
the expression constitutes the right hand side in 
a vector assignment. In this case the compiler 
attempts to use the left hand side variable in 
place of one of the vector temporaries. 

Five SL/1l programs which were considered 
typical were measured by an instrumented version 
of the SL/1l compiler. Table 1 shows the total 
number of quads and words of machine code with and 
without all optimizations, and the length of each 
program in lines. On average 27% of the quads, 
and 28% of the machine code were removed. 


TABLE 1 - Overall Quad and Code Reductions 


Program Number 


1 2 3 4 5 

Source Lines 255 691 986 603 1647 
Quads Without Opt. 641 2894 1910 1665 3741 
Quads With Opt. 491 1710 1688 1196 2516 
Reduction 23.4% 40.9% 11.6% 28.2% 32.7% 
Code Without Opt. 554 3567 2638 1835 3680 
Code With Opt. 462 2168 2229 1079 2656 


Reduction 16.64 39.24 15.54 41.24 27.82 

Several quad operations had a relatively high 
probability of being redundant. Sixt y-four 
percent of the scalar addition quads and 5/74 of 
the subvector quads were removed by common 
subexpression elimination. An optimizer which 
considered only these two quad operations would 
detect. 864 of the total number of redundant quad 
operations for the five sample programs. 

Table 2 shows the static frequency of 
occurrence of certain SL/1 statements. Assignment 
represents at least 74% of the total number of 


executable statements and the average proportion 
is 85%. As well as occurring in large numbers, 
assignment statements occur in groups and large 


basic blocks tended to dominate. Table 3 shows 
the largest basic block observed for each program 
and the proportion of each program which was’ made 
up of basic blocks which were ten or more lines 
long. 


TABLE 2 - Statement Frequencies 


Program Number 


1 2 3 4 5 
Assignment 115 357 357 244 757 
Procedure Call 9 3 34 8 69 
IF Statement 13 1 ll 7 26 
FOR Statement 7 19 12 7 5 
WHILE/REPEAT Statements 2 0 6 0 l 
GO TO Statement 2 0 1 0 2 


TABLE 3 - Basic Block Sizes 


Program Number 


1 2 3 4 5 

Largest (In Lines) 19 110 38139 176 8293 
Ten or More Lines 40% 702 71/4 =—88% = 66% 

From Tables 2 and 3 .it can be seen that 
relatively little use is made of control 
structures. A simpler optimizer is possible if 
common subexpression analysis is performed only 
across basic blocks. The SL/1 optimizer was 
modified to operate in this way and the five 
sample programs were recompiled. Table 4 shows 


the total number of quads and words of machine 
code with this less powerful optimization and with 
no optimization. | 


TABLE 4 — Quad and Code Reductions 


Program Number 


1 2 3 4 5 

Quads Without Opt. 641 2894 1910 1665 3741 
Quads With Opt. 537. 1710 1767 1200 2617 
Reduction 16.2% 40.9% 7.54 27.9% 30.04 
Code Without Opt. 554 3567 2638 1835 3680 
Code With Opt. 491 2168 2320 1082 2721 
Reduction 11.4% 39.2% 12.1% 41.0% 26.12% 


The effectiveness of eliminating unnecessary 
vector temporaries was measured and the results 
are shown in Table 5. The average reduction in 
code volume is 10.3%. These measurements were 
made without common subexpression elimination. By 
comparing Table 5 with Table 1 it can be seen that 
in terms of code volume reduction, eliminating 
unnecessary vector temporaries made a large 
contribution to the total optimizer’s performance 
on three of the sample programs. | 


TABLE 5 - Vector Temporary Elimination 
Program Number 


Code Without 554 3567 2638 1835 3680 


Code With 549 3061 2384 1372 3623 
Reduction 0.9% 14.2% 9.6% 25.2% 1.5% 


Table 6 shows the measurements of code motion 
on the sample programs. No candidate quad was 
found to be invariant inside two or more nested 
loops in any of the programs. The performance of 
code motion is rather poor due 
caution which is exercised in selecting operations 
to move, and partly to the relatively small 
numbers of explicit program loops. | 


TABLE 6 - Code Motion Effect 


Program Number 


l Z 3 4 5 
Total Quads 491 1710 1688 1196 2516 
Quads Considered 60 74 230 50 31 
Quads Moved 10 7 194 2 20 


For the majority of users, the most important 


benefit from optimization is the reduction in 
program execution time which it is expected to 
produce. The five SL/1 programs’ used in this 


study were each executed with no optimization and 
with full optimization using data supplied by the 
programmer and regarded as_ typical. The 
percentage reductions in execution times produced 
by the optimizations were: 


Program Number 


1.04 28.87% 1.79% 


Except for program 2, these reductions are hardly 
of any value. Notice that program 2 also 
experienced the largest quad volume reduction. 
The reason for these poor results is that’ the 
optimizer removes’ scalar operations only and the 
five programs were heavily vectorized; their 
execution times were dominated by very long 
duration vector instructions. The execution of 
10,000 scalar instructions must be prevented in 
order to have an effect comparable in execution 
time with a single vector instruction operating on 
long vectors. The performance of the optimizer in 
the critical area of execution time is thus very 
dependent on the vector lengths used and the 
degree of vectorization of the program. 

In order to assess the effect of different 
vector lengths on optimizer performance, program l 
was executed with vectors ranging in length from 
64 to 16,128. The. percentage reduction in 
execution time for the various vector lengths 
resulting from use of the optimizer are: 


Length Reduction Length Reduction 


64 31.0% 5888 3.0% 
128. 26.0% 7168 2.5% 
256 25.0% 8448 3.6% 
320 20.0% 9278 2.2% 
640 17.0% 11008 1.7% 
768 15.6% 12288 1.8% 

2048 10.2% 13568 0.7% 
3328 5.0% 14848 1.3% 
4608 3.7% 16128 —«-:11.0% 


partly to the 
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For very short’ vectors, the  optimizer’s 
performance is considerably better than with long 
vectors. 

In the SL/1 optimizer, a very small subset of 
the quad operators was responsible for most of the 
code removal, and analysis of common 
subexpressions across control structures and code 
motion both proved relatively ineffective. In 
addition, the ratio of instruction execution times 


means that the effects of optimization are 
extremely program dependent and in terms of 
execution time, optimization was of almost no 
benefit in many cases because of the dominance of 


long vector instructions. This problem is 
significantly worse with the CYBER-203 where the 
instruction execution time ratio is much higher. 
The CRAY-1 [7], on the other hand, has a maximum 
vector length of 64 and the instruction execution 
time ratio is orders of magnitude less than that 
of the STAR-100 and CYBER-203. The optimizations 
attempted by the SL/1 compiler would probably be 
much more effective on programs which are executed 
on the CRAY-l. 

A very simple optimizer is 
appropriate ‘for  vector-oriented languages 
machines like the STAR-100. 


probably most 
on 
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1. Introduction 


The present work is concerned with the high 
level language programming of a large-scale, 
tightly-coupled, speedup-oriented MIMD machine of 
the type proposed in [1] or [2]. | 


We will assume a high level language which 
differs from traditional high level languages only 
in that it contains constructs for: 


* dynamic (run-time) spawning of parallel 
processes 


* run-time communication between processes 


* dynamic (run-time) identification of one 
process by another for the purpose of estab- 
lishing communication. 


What we have in mind for the first of these 
constructs is something on the order of a SPAWN 
statement of the following type: 
SPAWN <name of code>(<parameter 1>;...,<parameter k>) 


What we have in mind for the second of these 
constructs is statements of the following type: 


a) WRITE BUFFER <buffer name> FROM 
<private variable> 


b) READ BUFFER <buffer name> INTO 


<private variable> 


where <buffer name> is the identifier of a shared 
variable. 


We will be concerned with:. 
i) showing that if direct interprocess com- 
munication is limited to communication be- 
tween processes which bear the parent-child 
relationship, with all other communication 
constructed indirectly from parent-child 
communications, then the speedup promised 
by parallelism will, in general, be 
vitiated 
ii) proposing a construct for the dynamic (run- 
time) identification of an arbitrary process 
by another process for the purpose of 
establishing communication. 

2. Parent-Child Communication 
We define the "spawning tree" of a system of 
cooperating sequential processes to be that graph: 


* whose. nodes represent the processes 
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* whose directed arcs represent the spawning re- 
lation. I.e., a directed arc from node A to 
node B represents the fact that A spawned B. 


Different systems of cooperating processes 
will, of course, have their own idiosyncratic 
communication patterns. In order to get a handle 
on the general case, we will assume a random pat- 
tern of necessary communication. I.e., we will 
assume that wherever a process might "sit" on the 
spawning tree: 


* each time a process needs to communicate with 
another process it will choose the latter 
process at random from a uniform distribution 
over all the processes (including itself for 
the sake of simplicity) in the system of 
cooperating processes 


- different communications from the same (source) 
process will be directed at destination proc- 
esses chosen independently of one another 


- different (source) processes will choose the 
destinations of their communications inde- 
pendently of one another. 


These assumptions, are, in one important 
sense, very optimistic. That is, in the long run. 
they ensure uniformity of spread of the total vol- 
ume of communication traffic over the set of all 
pairs of processes rather than possibly skewing 
that same total volume of traffic. What we will 
see, however, is that the spawning tree, because 
of its structure, will still form a very ineffi- 
cient base for carrying communications; i.e., we 
will see that uniformity of traffic over the set 
of all pairs of processes when superimposed on the 
hierarchical structure of the spawning tree creates 
intolerable speedup-vitiating bottle-necking. 


To start, then, let us take as our unit of 
time the time within which a process - on the aver- 
age - sends a communication to some (randomly 
chosen) process. To simplify matters, and without 
loss of generality, we will assume that each proc- 
ess sends exactly one communication to some process 
during each unit of time. 


We will assume, then, that once per unit of 
time each node (process) of a full binary (spawn- 
ing) tree containing N = 2™-1 nodes will generate 
one communication addressed to some node chosen at 
random from among all N nodes. For each arc, oa, of 
the tree we will be interested in the amount of 
traffic, Ty, - i-e., the number of communications - 
generated during one unit of time and destined to 
traverse the arc a at some noint in its journey 
from its source to its destination. More precisely we 
will be interested in E(T,y), the expectationof T,. 


Now a full binary tree with N = 2™-1 nodes 
is, of course, of depth d=m-1. For some & then, 
1<X&<m-1, let the arc a be 2 levels up from the 
leaf nodes of the tree as in Figure 1. If we let: 


* a be the number of nodes in the subtree tj, 
of Figure 1 (including the root node of tj) 


* b be the number of nodes in that part of the 
tree of Figure 1 (clearly not a subtree) 
labeled ty (i.e., all the nodes of the entire 
tree except those in tj) 


: TuP be the number of communications (gener- 
ated during one unit of time) destined to 
traverse a in the upward direction 


. 7cown be the number of communications (gener- 
ated during one unit of time) destined to 
traverse a in the downward direction 


then: 
up} — ab = down 
e{meP) = 2 = 2 [adown 
But a = 2" —~ 1, and b = 2-1-a=2M_2*, We 
thus have: 
| p | rUP | _ (2%-1) (27-2") 
i omy: 
gd=m- 1 
levels of 
&£-1 arcs 
levels 
of 
arcs 
(FULL BINARY SPAWNING TREE OF DEPTH d) 
Figure 1 
or 
E(T,) _ 2m_ 1 


If we now let a be either of the ares for 
which 2 =m- 1, i-e., either of the arcs directly 
emanating from the root of the full binary tree we 
see that 


a0 yon) 
gry 


es ( om 2) gm 1 
gi 1 
aaa 


- What this means is that, subject to our opti- 
mistic statistical assumptions, each period during 


R(T) 


(N-1) 


which every process generates one communication 
causes the process at the root of the tree to per- 
form O(N) units of work sequentially. (The root 


- process is, after all, as are all the processes, a 
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sequential process, and it is expected to have to 
handle (N-1) (2m-1/om-1-1) communications.) What 
this means, among other things, is that an N- 
process system of parallel processes is not ex- 
pected to terminate in less than O(N) time. 


3. Process Identification 
Given, then, that in general implementing an 

arbitrary inter-process communication as a sequence 

of parent-child communications leads to intolerable 

loss of speedup, it is necessary for communicating 

processes to be able to directly identify one an- 
other for the purpose of establishing direct com- 
munication. | 


In the simplest case, i.e., that of two spe- 
cific processes which are known at compile-time 
(actually, at the time the program is written) to 
have to communicate with one another, there is no 
problem. For example, suppose that procedures 
PROCA and PROCB are each to be activated exactly 
once, and that the one activation of PROCA is to 
communicate to the one activation of PROCB a re- 
sult which the former will compute and store inits 
private variable RESULTA; the programmer need 
simply invent a buffer name, say BUFFAB, and aname 
for a private variable, say RESULTFROMA, and then 
write the code for PROCA and PROCB as in Figure 2. 


PROCEDURE PROCA; 
SHARED BUFFAB; _ 


e 


RESULTA: = ...s $5 
WRITE BUFFER BUFFAB FROM RESULTA; 


END; 


PROCEDURE PROCB; 
SHARED BUFFAB; 


READ BUFFER BUFFAB INTO RESULTFROMA; 


e 


END; | 
(EXAMPLE OF COMMUNICATION CODE WHEN COMMUNICATION | 
PATTERN IS KNOWN EXPLICITLY AT TIME OF PROGRAM- 

WRITING) | 
Figure 2 


Suppose, though, that the situation is more 
complicated, i.e., suppose that for the application 
of interest, processes must dynamically - i.e., on 
the basis of results which they will compute rather 
than on the basis of criteria explicitly known at 
compile time - "develop the need" to communicate 


with one another. How, in this case, are proce-. 
dures to be coded in such way that processes which 
"develop the need" may establish a means of com- 
munication with one another? 


For the purpose of enabling such communica- 
tion, we propose constructs for the dynamic crea- 
tion of variable names. To wit, the notion of a 
schematic variable name is defined as follows: 


<schematic variable name>:: = (<schema>) ; 
<schema>:: = <character> | 
<arithmetic expression> | 


<schema><schema>; 


Note that in the above definition of <schema>, 
<arithmetic expression> denotes an arithmetic ex- 
pression each of whose characters is underlined. 


The semantics of schematic variable names is 
as follows: 


* an underlined arithmetic expression is to be 
evaluated, and the numeric value translated 
into the character string representing that 
value 


- a character not underlined represents itself. 
Thus, for example, if 
WRITE BUFFER (JOE/1**2/J+8)FROM <private-variable> 


is executed at a time at which I has the value 5 
and J has the value 3, then the statement which 
will effectively be executed will be 


WRITE BUFFER (JOE/25/11)FROM <private-variable> 


The manner in which processes which dynam- 
ically develop the need to communicate establish a 
means of communication is clear. Before the intro- 
duction of dynamically-created names two processes 
communicates with one another if and only if one 
executes a statement of the form READ BUFFER 
<buffer-name 1> INTO <private-variable 1>, the 
other executes a statement of the form WRITE BUFFER 
<buffer-name 2> FROM <private-variable 2> and 
<buffer-name 1> happens to be identical to 
<buffer-name 2>. This is still of course true, 
but now the name of the buffer may itself be com- 
puted at run time. 


The specific details of the proposed construct 
for the dynamic creation of buffer names, however, 
is not the important point. Rather, what is of 
consequence is that once large-scale, tightly- 
coupled, speedup-oriented MIMD computation becomes 
a widespread reality, algorithms will be developed 
which will require the dynamic establishment of 
communication on the basis of computed results. 
This will be the case, for example, in the solution 
of PDE's over dynamically varying grid structures 
and in such AI applications as natural language 
understanding. In such applications, some means 
for the dynamic creation of buffer names or some 
alternative means for the dynamic identification 


of one process by another will be of critical 
importance. 
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Abstract 


A computer architecture is presented that 


processes in parallel programs written in high lev | 


el languages capable of being expressed in the 
lambda notation (applicative languages) . 


Internally, it is a collection of weakly-cou- 
pled general purpose processors, without a hierar- 
chy among them. Each processor evaluates a part of 
a program, thus permiting asynchronous computation. 


The architecture here exposed has been devel- 
oped for the Lisp language, although other appli- 
cative languages are also possible. The hardware 
implements the function calls, argument passing 
and sequencing of tasks. Each processor is a Z-80 
microprocessor that is programmed to execute the 
Lisp primitive operations. 


The AHR machine operates as a slave of a 
general purpose minicomputer. This avoids doing 
1/0 in the AHR machine. In addition, all interac- 
tions with the user(s) are done by the normal 
operating systems of the mini. 


The machine is being built at the Comput ing 
Systems Dept. (IIMAS). 


|. Introduction and Project Status 


This paper presents the architecture of a 
parallel general purpose computer that has Lisp as 
its main programming language. It is built of 
several dozens of microprocessors (Z-80's), each 
of them executing a part of the program. 


Goals 


The goals of the Project AHR (Arquitecturas 

Heterarquicas Reconfigurables) are: 

* To explore new ways to perform parallel proces- 
sing. 

* To have a machine in which it will be possible 
to develop parallel processing languages and 
software 

* To have a tool for students to learn and practice 
parallel concepts in hardware and software. 


Project Status 


Version 0 [3] of the machine has been de- 
signed and simulated. This produced Version 1[12] 
which was simulated using SIMULA. Results of the 
simulation are not to be found here, but in[ 8, 9, 
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pected to be operational [5] in 1981. 


12] instead. 


We are building Version 1 of the machine, ex- 
Subsequently, 
a faster version will be built, possibly incorpo-. 
rating changes and ideas sprung from our experience 
with the first machine. Finally, this fast version 
will be used to try to attain the goals mentioned 
above. 


About six people full time are involved in 
the project. 


The expected uses of the machine also include 
picture processing, finite element methods, engi- 
neering calculations, and distributed processing. 


Matn Features 


The AHR machine has the following character- 
istics: 


* general purpose. 

* parallel processor. 

* heterarchical. It means that there is no 
hierarchy among the processors; there is no 

or controller. All the 
processors are at the same level. 

* asynchronous operation. | 

‘ it has Lisp as its main programming language 

* processors do not communicate directly 
among themselves. They only ''leave work"! 
for somebody else to do it. 

* no input/output. This is handled by a mini- 
computer to which tne AHR machine is 
attached. 

* no operating system (cahere): Most of the 
Lisp operations, as well. as the garbage 
collector, are written in Z- 80 machine 
language 

* the AHR machine works as a slave of a gen- 
eral purpose computer (a mini or micro). 

* gradually expandible. More microprocessors 
can be added as additional compuryng power 
is needed. [9 1 


Imacter!! processor, 


Functional Notation 


The AHR machine obtains its super yeiten by 
parallel evaluation of the arguments of functions. 
For instance, in f(a,b, g(u,g(x,b))), first x and 
b are evaluated; then g of them, in parallel with 
u; then g of the result, in parallel with a and b. 
That is, evaluation occurs from bottom up, or from 
the inside to the outside of the expression. This 
is In accordance with the rule for evaluation of a 


function: ''to evaluate a function, the arguments 
have to be already evaluated". 


Recursion is handled [3] by substituting the 
function name (''FACTORIAL'') by its function defi- 
nition (LAMBDA (N) (IF (EQ N 0) 1...)) when eval- 
uating it. 


The machine works with pure Lisp, without 
SETQ's, GOTO's, Label's, RPLACA. 


11. The Parts of the AHR Machine 


In this section the constituents of the ma- 
chine are described; section I!!! explains how the 
machine works. Refer to figure 2. 


Passive Memory 


This memory holds lists and atoms; it holds 
partial results and parts of programs that are 
not being executed at the moment. 


Originally, the programs to be executed re- 
Side here, and they are copied to the grill for 
their execution. As new data structures are built 
as partial results of the evaluation, they come to 
the passive memory to reside. 


The Grill 


This memory holds the programs that are being 
executed. A program, once in the grill, is being 
transformed into results, as the result of its 
evaluation. 


Programs reside in the grill in the form of 
nodes, as figure 1 illustrates. Each node is 
pointed at by its sons (its arguments), and its 
nane field contains the number of nonevaluated 
arguments. Nodes with nane = 0 are ready for eval- 
uation. 


The Lisp Processors 


These active units are microprocessors (about 
several dozens of Z-80's) that obtain from the 
grill nodes ready for evaluation, and, after eval- 


uation, return results (s-expressions) to the grill. 


Each Lisp processor knows how to execute every 
Lisp primitive. Each of them works asynchronously, 
without communicating with other processors. 


The processors obtain new work to be done 
from the distributor, through the high speed bus. 
This work comes as a node ready to be evaluated. 


Only nodes with nane = 0 come up to the 
processors for evaluation. So, for instance, 
(CAR '(A B C)') will evaluate to A. The node 
(CAR '(A B C)') has become the result A. The Lisp 
processor has to do, after evaluation, the follow- 
ing things: 


Lisp 


1.- Insert the new result A in the cell (in 
the grill) pointed to by the node (CAR 
'(A B C)'). That is, insert such result in 
a slot of the father of the evaluated 


node (see such slots in figure 1). 
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Release the grill space occupied by node (CAR 


2.7 
'(A B C)'). | 
3.- Substract 1 to the nane of the father. 
4.- If the new nane (of the father) is zero, in- 


scribe the father in the fifo: the father is 


now ready for evaluation. 


(LIST (CONS (CAR A) 
(CDR B) ) 


Figure 1 


NODES IN THE GRILL 


Above, the Lisp expression to be 
evaluated. Below, how it is stAuc- 
tured tnto nodes, each node being 

a function on a vartabke. Each rode 
Shows a number: its nane, or number 
of non-evakuated arguments. When a 
node has a nane of zero, 4t means 
that such node 45 neady for eval- 
uation. 

Empty wonds ane slots where the 
nesukts of evaluation wih be inser- 
ted. Fon tnstance, the results of 
(COR B) will be inserted in the slot 
marked "*", 


These steps are initiated by the processor 
simply by signaling to the distributor that the 
processor has finished, and that its results 
should be handled in mode "normal end'' (burocracia 
de salida, in Spanish [12] ); the distributor 
itself performs the requested steps. 


Notice that in this form nobody has to search 
the grill looking for nodes with nane=0, because 
as soon as they appear, they are inserted into the 
tail of the fifo. 


The Lisp processors have access to the pas- 
sive memory (where lists and atoms reside), and to 
the variable memory, where we have the values of 
variables. 


A Lisp processor is either busy (evaluating 
‘a node) for it is ready to accept more work (an- 
other node). 


The high speed bus 


Connecting each Lisp processor with the dis- 
tributor is a high speed bus that goes into the 
private memory of each processor. The new node 
that the distributor throws is inserted (through 
the high speed bus) into the memory of the selec- 
ted processor. Then, the processor is signaled to 
proceed. 


The slow speed. bus 


This bus runs from the i/o processor (the 
mini or micro to which the AHR machine is connect- 
ed) to each bos. It is not shown in the diagrams, 
nor it is explained furthermore in this article 
(See [5] ). Through this bus each processor is 
loaded with programs, prior to starting the ma- 
chine. Also, in the debugging stage, the slow bus 
is used to pass statistical information to the i/o 
processor. The slow speed bus is not used during 
normal execution of Lisp programs. 


Variable Memory 


TA) we Bae Oe ee ee ee Ro cs OL Be Beso N 

mis Memory Contains pairs Or \Variavie,value), 
and it is organized as a tree, or a collection of 
a-lists, where each pair (variable,value) points 


to older pairs. It is accessed by the Lisp proces- 
sors, and it is augmented (a branch of the tree 
grows) after each LAMBDA binding. 


Since the evaluations are.made in parallel, 
the a-lists could grow in parallel, too. For in- 
Stance, consider the following expression 


BODYO: (list ((lambda(X) BODY1) 3) ((lambda(x) 
BODY2) 4)). 

Then, if when evaluating BODYO the a-list jis 
ALISTO: ( (xX,A) (Y,B) (z,9) ) 

Then, when evaluating BODY1, the a-list is 
ALIST1: ((X,3) (X,A) (Y¥,B) (Z,9)); 

and when evaluating BODY2, the a-list is 
ALIST2: ((x,4)  (X,A) (Y,B) (Z,9)). 


But since the evaluation of BODY1 and BODY2 
can be carried in parallel (by two different Lisp 
processors), this means that ALIST1 and ALIST2 
coexist at the same time in variable memory, but 
~BODY1 points to ALIST1 and BODY2 points to ALIST2. 
So, each processor has its “appropriate! a-list to 
work with. 
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Both ALIST1 and ALIST2 share ((X,A) (Y,B) 
(Z,9)) between them. That is, they "share'! ALISTO. 
ALISTO grew in two directions, like a tree, giving 
rise to ALIST1 and ALIST2 simultaneously. This 
explains the affirmation that ''the variable mem- 
ory contains a tree of a-lists''. 


The Distributor 


This. piece of hardware communicates the grill 
with the Lisp processors. The distributor keeps 
in the fifo (a memory) an array of nodes ready to 
be evaluated; these nodes are thrown, one in each 
cycle of the distributor, to the Lisp processors 
that are ready to accept new work. An arbiter 
decides which Lisp processor obtains the node; an 
exchange is done (through the high speed bus) 
between that Lisp processor and the distributor, 
the processor accepting the node and releasing the 
result of the previous evaluation, The distributor 
Stores the result in the grill, in the address 
indicated within the result. Generally, this re- 
sult Is stored in a slot of the node which is fa- 
ther of the node just evaluated. 


An overall view of the machine is shown in 
figure 2. 
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FIGURE 2 
THE AHR MACHINE 


Lisp processon 2 is neady to accept more 
work. The distributor fetches a node (to 
be evakuated) from the fifo and sends it 
to processor 2, while accepting the ne- 
sults of the previous evaluation performed 
by such PAHOCLASON. That nesult is stoned 
An the griklh, in a place indicated in the 


destination address of the result. 

Such exchange of new work-- 
previous result 4s perforuned at 
each cycle of the distributor. 

Version 2 of the AHR machine 
will gain speed over Version 1, 
mainky by building a faster distri- 
butor. 

The Lisp processors also have 
access (connections not shown) to 
the variable and passive memories. 


The Fifo. 


The fifo is a first in-first out memory that holds 
pointers to nodes (in the the grill) ready to be 
evaluated. The distributor fetches such nodes 
through the head of the fifo, while new nodes to 
be evaluated are inserted through its tail [5 ]. 


The arbiter. 


If several Lisp processors become ready to accept 
more work, the arbiter (a hardware) selects one of 
them, which will receive the node thrown by the 
distributor. 


If every processor is busy, the cycle of the 
distributor is wasted, since no processor accepts 
the node that the distributor is offering. 


The 1/0 Processor 


It has been said that the AHR machine can be 
seen as a peripheral of a general purpose mini- 
computer. But this mini can also be considered as 
a peripheral of the AHR machine; we thus talk of 
such mini as the 1/0 processor. 


Input/output will be described in next sec- 
tion. 


l}t. How The Machine Works 


Input 


The user uses a terminal of the mini or mi- 
cro (i/o processor) which is master of the AHR m- 
chine. He uses a common editor, disks and the 
normal operating system of the mini. When he js 
ready to run a program, he loads it from disk into 
a part of the address space of the mini which is 
really the passive memory of the AHR machine (see 
figure 3 . In this way, the program is loaded (al- 
ready as list cells) in the passive memory. A sig- 
nal from the i/o processor to the AHR machine 
signifies that Lisp execution should begin. Togeth- 
er with this signal an address is passed, indicat- 
ing where in passive memory resides the program to 
be evaluated. 
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FIGURE 3 
"THE AHR MACHINE AS A SLAVE!! 


The AHR computer is shoun a another 
perimenral of a general purpose 
mintcomputer. The address space of 
the mink comprises the passrtve memory 
of AHR, through a movable window of 4k 
AddiQbS eb . 


Starting 


lt is assumed that each Lisp processor already 
has its programs loaded in its private memory. 


When the AHR machine receives the ''start'! 
signal, the distributor throws a node (called the 
RUN node) to some Lisp processor. This node points 
to the program which will start. 


The program (in passive memory) is copied (i. 
e., transformed from its passive-memory representa- 
tion, which is in list notation, to its grill-rep- 
resentation, which is composed of nodes) by more 
and more Lisp processors (the more leaves or 
branches a program has, the more processors help to 
copy it. Each processor copies a branch of the 
program )into the grill. Nodes with nane=0 are 
inserted by the Lisp processors into the fifo, so 
that some other Lisp processors will execute them, 


Finally, the program has been copied into the grill. 


Notice that at the same time of copying, some nodes 
with nane=0 could have been evaluated by some 
other Lisp processors. 


Evaluation 


When a Lisp processor is idle, it signals to 
the distributor, meaning that it is ready to ac- 
cept more work. 


The distributor chooses (with the help of an 
arbiter) one of several idle processors, and 
through the high speed bus it injects a new node 
[taken from the grill through the head of the fifo] 
into its private memory. It then signals such 
processor to start. 


The Lisp processor. ''discovers'' the node in 
its own memory, with all the arguments already 
evaluated. The Lisp processor proceeds to perform 
the evaluation that the node demands. Suppose it 
is LIST, and its arguments are (A B), M and N. It 
then has to address the passive memroy in the mode 
"give a new cell''. Such cell is given by a cell 
dispatcher (hardware attached to passive memory). 
Three new cells have to be requested. Then the 
Lisp processor forms the result: ((A B) MN). For 
this, it has to store pointers to (A B), to M and 
to N, into passive memory, jin the new cells al- 
ready obtained. Then, it stores the result (which 
is a pointer to passive memory) into a special 
place ("'results place'') of its private memory. It 
has finished. It signals to the distributor that 
it is ready to accept more work. The distributor 
will insert new work (another node with nane=0 ) 
into the private memory of the processor, but it 
will also collect (through the high speed bus; 
see figure 2)from the ''results place!’ in private 
memory, the result ((A B) MN). The distributor 
will store this result into a slot in a node in 

rill, ; The address inthe grill of this slot 
was noua to the (LIST (A B) MN) node, because 
each node points to fits father. Thus, the distri-. 
butorhas no problem in finding where to store the 
result: such address is found also in the '"'results 
place'', together with the result ((A B) MN). 


The distributor-has to do one more thing: it 
has to substract one from the nane of the father 
(which has just received the result ((A B) MN ). 
And if such nane becomes zero, then a pointer to 
the father is inserted by the distributor into the 
fifo through its tail. 


| One last thing: the distributor has to free 
the cell of the node (LIST (A B) MN), so that 
this grill space could be reused [10] , 


The distributor is very fast compared with 
the speed of the Lisp processor. This will be even 
more true if we code "Comp! icated!! Lisp functions: 


(such as MEMBER OF FACTORIAL) in Z-80 machine lan- - 


guage, instead of ''simple" Lisp functions, such as 
_ CDR. | 

“Due to such difference in speed, the distrib- 
utor can keep many Lisp processors working; if the 
distributor is 100 times faster than the (average) 
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Lisp Processors 


bits carry data. 


Lisp function, it could keep 100 Lisp processors 
functioning. It pays to make a fast distributor. 


Output 


Finally, the whole program has been converted 
into a single result (let us say, a list) deposit- 
ed in passive memory. The AHR machine now signals. 
the mini (or i/o processor), giving it also the 
address in passive memory where the result lays. _ 
The mini now accesses the passive memory as if it 
were part of its own memory (remember, their ad- 
dress spaces overlap), and proceeds to the (serial) 
printing process. 


Execution has finished. 


IV. Hardware Considerations 


The first version of the machine will have 5 
Lisp processors, and the i/o processor is another 
Z-80. Each Lisp processor will have 4K bytes of 
private memory, where a pure-Lisp interpreter wil] 
reside [8 ]. 


The maximum number of Lisp processors is 64. 
It could be increased further, but a new arbiter 
needs to be designed in that case. 


The high speed bus 


The distributor inserts a node (7 words of 32 
bits) into the private address space of the select- 
ed Lisp processor, through the high speed bus. It 
does this in 0.5 microseconds. The high speed bus 
runs from the distributor to all Lisp processors. 
lt carries nodes and results. 


The low speed bus 


A 16 bits low speed bus; 8 of them tndicate 
which Lisp processor is addressed, the other 8 
It runs from the i/o processor to 
the Lisp processors. 


An additional use of the low speed bus js to 
broadcast to the Lisp processors the number of a 
program that needs to be stopped or aborted. 


Passive Memory 


20 words of 22 bits; it 
list space, output ports 


It consists of up to 2 
contains the input ports, 
and atom space, 


Version 1 will have only 64K words. 


Access time is 150 nanoseconds. 
bit, 


It has a parity 


The Grill 


words of 32 bits. It 
each with 7 words. 


It consists of up to 219 
is divided logically in nodes, 


Version 1 will have 8K words. Access time is 
55 nanoseconds. The grill contains the nodes that 


are about to be evaluated. 
Variable Memory 


It consists of up to 2!3 words of 32 bits. 
This memory contains names of variables and their 
values at a given time. The variable memory contains 
also real numbers, in its lower half. In its upper 
half it has "'environments'', which are lists of 
cells of 5 words each. 


Version 1 will have 16K words. Access time is 
150 nanoseconds. 


The Distributor 


_ The distributor passes nodes from the grill 
to the Lisp processors, and stores in the gril] 
the results coming from the Lisp processors. There 
are two versions of the distributor. 


First versionof the distributor: 


This first version [10] is implemented through a 
Z-80, using a program that performs all the func- 
tions of the distributor, It runs slowly, in the 

sense that distributes nodes at low speed, It js 

further described in Section V-Software conside- 

rations. 


Second version: fast distributor: 


Not yet built, it will become part of version 1 of 
the machine. It will be built either from bit-slice 
microprocessors, of from PAL's. 


The fifo 


Of a maximum size of 2!9 words of 19 bits, it con- 
tains pointers to the nodes in the grill. Version 
1 will be of 4K words. Its access time is 55 nano- 
seconds, . 


The arbiter 


There are really three arbiters, for passive memory, 
variable memory and for the grill. 

Each arbiter takes 400 nanoseconds to respond, 
and it may handle up to 64 processors. Each proces- 
sor has a fixed priority, varying from 1 to 64. 
Each processor has a different (unique) priority. 
The assignment of priorities to processor really 
does not matter, since all of them are equal (they 
are able to perform exactly the same tasks). Of 
course, if there are too may processors, those with 
lowest priorities will never obtain work (nodes) to 
do. 


The 1/0 Processor 


It is actually built around a Z-80 that works 
as a general purpose computer. Its main functions 
are: 


* to 
to 
* to 


talk to the users; to read their input and 

print their results. 

store user files in its disk. 

©‘ to initialize the AHR machine. 

* to load into passive memroy, through the 
window, the programs loaded from disk. 
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* To begin garbage collection. 

* To end garbage collection. 

* Actually, the garbage collector runs jn the 
i/o processor. 


V. Software Considerations 


The Lisp Interpreter 


A Lisp interpreter runs in each Lisp proces- 
sor. It interprets pure Lisp (only evaluations; 
no setq's, rplacd's or other operators). The 
garbage collection is not done, at this moment, 
by the Lisp processors. 

For the first version, the Lisp interpreter 
will do argument checking of the Lisp functions. 
This will reamin as an option in the second ver- 
sion of the AHR machine. 


The Garbage Collector 


For the first version of the machine, it will 
be a ''normal'' serial garbage collector, running 
in the i/o processor. While it works, the Lisp 
processors remain idle. For the second version, it 
will be a parallel incremental garbage collector, 
running in the Lisp processors. 

Garbage collection fs done for passive memory 
(list cells) and for the real numbers region of 
variable memory (where it compactifies memory). 

In the ''environments'' zone of variable memory and 
in the grill (nodes), there is no need to recol- 
lect garbage, because used space, as soon as it 
is abandoned in these two places, it is inserted 
(by hardware) into a list of free environment 
cells (for variable memory) or into a list of 
free nodes (for the grill). 


The Distributor (First Version) 


This is a piece of software [10] running in 
a Z-80, that emulates all the functions that the 
"'real'' (hardware) distributor performs. It js 
slow in this sense, but it is flexible and helps 
in the debugging of the AHR machine; it may be 
run ''step by step'’ to see the flow of information. 
It also keeps statistics of use of hardware and 
software. 


Editing 


Editing of Lisp programs is done outside the 
AHR machine, using the operating system and editor 
of the i/o processor. After editing, the program 
is filed on disk. From here, a loader (running 
in the i/o processor) converts it into list cells 
and brings the program to passive memory. See 
Figure 3. 


Performance of the Machine 


No figures can be given at this time, since 
the AHR machine is not yet completed. 


New Advances as of June 1981. 


The hardware is now working; the software is 


about to be completed. 
VI. Related Work and Machines © 
Greenblatt's Lisp Machine 


This is a single processor machine [14] built 
for high speed Lisp computations. It does not pre 
tend to be an experiment in parallel hardware; it 
gains its speed and power from careful design of 
the software and machine architecture, as well as 
from the experience of the builders with the Lisp 
language. | 


Parallel Lisp Machine 


The machine [7 ] is a loosely coupled multi- 
processor for applicative languages such as Lisp. 
It is the machine most closely resembling ours, in 
its application. 


Data Flow Machines 


These machines [13] resemble the AHR archi- 
tecture in that data is directed through ''boxes" 
that process them. The flow of executions is 
controlled, like in our design, by what previous 
results are ready (available). The cited article 
describes a machine that uses different colors of 
tokens to mark ''this result'', ''previous result", 
and so on. 


Zmob 


A collection of Z-80's around a conveyor belt, 
this machine [11] may be applied to image pro- 
cessing and numerical calculations. Each micro~ 
processor has its own private memory. They do not 
iave a common memory (as AHR 
of the micros, a huge central 
may reside. 


does), but behind one 
memory or mass memory 


Pat 


This is a machine [2] suitable for jiamge 
processing. It is a dynamically reconfigurable 
multimicroprocessor~based machine. It can be par- 
titioned into several groups of processors which 
may be assigned’ to execute multiple independent 
SIMD processes and MIMD processes. 


The Language ''L'' for Image Processing 


"L" is a language suitable for processing of 
images. It is mentioned here because it may be im- 
plemented in a parallel machine [4] , such as the 
AHR computer. The language is described elsewhere 
[1], It was designed mainly as a 
result of our experience in picture processing of 
multispectral images [6]. "'L'' has not been im- 
plemented. 


VII. Conclusions 


The architecture of the AHR computer shows 
that it is possible to build a multiprocessor of 
the MIMD type, where each processor does not 
explicitly communicate with other processors. In 
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The AHR design, a processor does not know how. 
many other processors are there, or what they 
are doing. It is not possible to address a pro- 
cessor: "here | have a message for processor 
number 4," | | | Oe . 

The construction of new software has been 
kept low by connecting the machine to a generat 
purpose computer, thus being able to use already 
available operating systems for time sharing, 
text editors and loaders. 

Once the machine is built, experimentation 
will begin in the design of parallel languages 
and ways to express ''powerful'' commands in 
heterarchical fashfon. Also, if the amount of 
access to memories for each processor is low, it 
may be possible to place each micro in a remote 
place, thus achieving some class of distributed 
computing. That is, a micro can process local 
work (through Basic, for instance) as well as 
remote (Lisp) work. 

Finally, the AHR machine shows how it is 
possible to design a heterarchical system, where 
none of the processors tells the others what to 
do, in what order to do it, or what resources 
are available to whom. 
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DISTRIBUTED PROCESSING APPROACH FOR THE INTERNATIONAL PUBLIC TELEGRAMS 
MESSAGE SWITCHING SYSTEM 


~ Jin-tuu Wang 

Yen-son Lee _ 

sntemnational Telecommunications Administration 
Taipei, Taiwan, Republic of China 


Abstract -- International Telecommunications 
Adminstration (ITA) in Taipei, Taiwan, Republic 
of China, recently has completed the application 
software development for its International Tele- 
gram Automatic Processing System (ITAPS). This 
system adopts an in-house computer network archi- 
tecture that includes four closely coupled mini- 
computers and more than two dozens of micropro- 
cessors. Two of the minis serve as the front- 
end communication processors and others as the 
host message switching processors. These minis 
are interconnected using the SDLC protocol. The 
microprocessors are connected to the front of the 

-communication processors using the RS-232C proto- 
col to handle Telex signalling for those telegrams 
to be delivered/accepted to/from the Telex net- 
work. The ITAPS is configured to provide full 
redundancy so that the hot-standby processors 
will take over the on-line task should any failure 
occur in the on-line system. One of the special 
characteristics of the ITAPS is to print-out 
Chinese address information automatically on the 
received international telegrams to facilitate 
messenger's delivery. Besides hardware architec- 
ture of the system, this paper also describes the 
functional characteristics of the system, software 
design and the integrating testing result. This 
system is one of the large scale software develop- 
ment projects that are carried on in this country. 


Introduction 


The recent advent of minicomputer technology 
prompted the prevailing applications of using 
minicomputer systems for various types of tran- 
saction processing £1,2,3 1. Message switching 


is one of such applications to automize the handl- — 


ing of message records. Although CCITT has set 
certain recommendations for these type of services, 
such as F.31 message format, various systems very 
often differ from one another due to different 
operational requirements of record carriers. 

The International Telecommunications Administration 
(ITA) has called an international open tender for 
the international telegram message switching 
system in 1974, however, the bid was unsuccessful 
because none of the venders could propose a system 
that could meet the user's operational require- 
ments. Furthermore, a non-standard project always 
requires tremendous man-hours to write the specific 
application software in order to meet these re- 
quirements, and the cost for developing such a 
non-standard software package is always very high. 
After few times of unsuccessful open tender on the 
turn-key basis, ITA decided to develop the nece- 
ssary application software to meet its own oper- 
ational requirements. A system appended with on- 
line handling of Chinese address information and 
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with inter-connection to the Telex network is pro- 
bably not available in the market. Therefore, 


it is worthwhile to develop such a non-standard 


system by yourselves not only to meet your own 
requirements but also to gain some practical 
experiences in the field of software engineering 
technology. 


Hardware Architecture © 
Fig. l shows the hardware configuration of 


ITAPS. Two GA-16/440 minicomputers with 112 KW 
core memory and Memory Management System (MMS) 


serve as the host message switching processors, 


while two GA-16/440 minicomputers with 64 KW core 
memory and Memory Parity and Protection (MPP) 
option serve as the front-end communication pro- 
cessors. These four processors are connected 
with SDLC links in such a way that each host has 
a front-end processor through a link and is the 
standby of the other on-line host. | 

In order to facilitate automatic delivery/ 
acceptance of incoming/outgoing telegrams to/from 
the Telex subscribers, Z-80 microprocessors are 
used to handle Telex signalling information with 
the Telex exchange. Each microprocessor is de- 
signed to handle four trunks of call setup and 
clear down signalling to/from the Telex exchange 
using 2K-byte EPROM and 256-byte RAM. These 
intelligent hardware interface boards were designed 
and manufactured locally to response to the CCITT 
No. 2 signalling protocol. Therefore, they serve 
as the protocol converters between the CCITT No. 2 
and the RS-232C protocols. 

All the peripheral devices are attached to 
the message switching host processors. These 
peripheral devices include 2 head-per-tracks | 
(drums), 4 moving head disks, 8 magnetic tape 
drives, 2 card readers, 2 line printers and 2 CRT 
terminals as console. 16 CRT terminals are atta- 
ched to the on-line host processor for manual 
assistance of the intercepted messages while 10 
CRT terminals are connected in distance through 
modem to facilitate telegram entering directly 
from ITA branch offices. These CRT terminals are 
always connected to the on-line processors through 
Automatic Bus Transfer Unit (ABTU). The function 
of the head-per-track is to serve as the transit 
storage for each telegram entering the system, 
while that of the moving head disk is to serve as 
the short-term journaling of telegrams for later 
retrieval and as the storage space for operational 
files and programs. The magnetic tape drives are 
for the long-term journaling of telegrams, automa- 
tic ticketing of outgoing telegrams, system and 
file backup. The card reader and line printer are 
for system software development. The head-per- 
tracks are all connected to the on-line host 
through ABTU while moving head disks are connected 
to the dual port disk formatters and can be 


accessed by either host processor. The magnetic 
tape drives, card reader and line printer are all 
dedicatedly connected to each host processor. 

All the communication lines are connected to 
the ITAPS communication processors via two types 
of asynchronous communication multiplexors, one 
for the slow speed trunk-lines or teletype ter- 
minals, and the other to the modems for the re- 
mote CRTs. The former is the GA-1595 multiple- 
xors that provides 64 lines PIO capability to 
input/output message character and line status 
one at a time after interrupt request. The latter 
is the GA-1535 multiplexors that provide 16 lines 
DMA capability to input/output message characters, 
line or page depending on the operational mode of 
CRT terminal. These multiplexors generate three 
types of interrupt to the communication processor, 
namely, the input buffer full, the output buffer 
empty and the status change of line so that the 
CPU can serve the respective type of interrupt to 
input, output character or sense the status of 
lines. The 1595 multiplexors are further connec- 
ted to two types of line adaptors, one is the 
current loop line adaptors that provide neutral 
current loop interfaces to trunk-lines, the other 
is the RS-232C line adaptors that provide EIA in- 
terface to trunk-lines for the Telex exchange. 
Portion of the current loop line adaptors are con- 
nected to a neutral/bipolar current converter for 
those lines that are in bipolar characteristics. 
Four serial-type graphic printers are connected 
via RS-232C line adaptor to the 1535 multiplexor 
for printing Chinese address information on the 
received telegrams. A line monitor and patch 
panel is also installed to provide signal monitor- 
ing, line cross-patching, and trunk line interfa- 
cing for all the low speed lines and trunks. 


Major Functions and Special Characteristics 


The major functions of the ITAPS are to per- 
form a store-and-forward message switching which 
automatically processes and routes both interna- 
tional incoming/outgoing telegrams to/from this 
country, stores the processed telegrams for later 
retrieval, and provides traffic relevant reports. 
Remote and local CRT positions are also provided 
to facilitate direct editing of outgoing telegrams 
at branch offices and manual assistance of the 
intercepted telegrams at the telegraph operation 
center (see Fig. 2). Detail functions are descri- 
bed as follows: 


Automatic Classification of Incoming Telegrams 


For the incoming telegrams, the ITAPS auto- 
matically classifies the telegrams into ten deli- 
very classes. Four major classes are: (1) to be 
delivered through the Telex network; (2) to be 
delivered by messengers, (3) to be routed to the 
domestic network, and (4) to be printed on the 
- local teletype terminals. 

By using the cable address in the received 
telegram as keyword, the ITAPS looks up the Telex 
number and the Automatic Answer Back (AAB) from 
the database, if there is any, and gives the num- 
ber to the microprocessor interface for automatic 
dial-out. If the circuit connection is success- 
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ful, the communication front-end processor will 
send "WHO ARE YOU" (Figure D) signal to the 
connected Telex terminal for obtaining an Automa- 
tic Answer Back Code. If a complete match occured 
between the returned AAB and that gotten out from 
the database, the ITAPS sends out the incoming 
telegrams to the Telex subscriber who has regis- 
tered using this cable address. If a Telex nun- 
ber is not found under this cable address, the 
ITAPS looks up the Chinese address information in 
another file. This file consists of over 20,000 
records, each of which contains the Chinese 
address of the telegram recipient in terms of 
Chinese character internal codes. Each one-word 
internal code is then translated into its binary 
graphic pattern. A group of these binary graphic 
patterns, lead by a graphic control code, are then 
sent down to a graphic printer which will print 
the Chinese address information including the 
company's full name and address in front of the 
English (ASCII) telegrams (see Fig. 3). 

For those telegrams routed to the domestic 
network, the city name on the telegrams will be 
verified against the city-name file. It will be 
routed to the respective line based on the infor- 
mation from the file. Local Teletype terminals 
include "Full Address" positions, "Service Tele- 
grams" positions, "Inter-office Communication" 
positions, and the "Intercept" positions for the 
abnormal telegrams that require manual assistance. 


Automatic Editing and Routing of Outgoing Telegrams 


For the outgoing telegrams, the ITAPS accepts 
the telegrams from the following four major sour- 
ces: (1) ITA's branch offices can send telegrams 
either by Teletype keyboard/paper tape reader, or 
by remote CRTs; (2) Telex subscribers can send 
public telegrams using simple format; (3) ITA's 
Telegram Operation Center can send telegrams 
either by Teletypes keyboard/paper tape reader, or 
by local CRTs; (4) domestic network can handover 
its international outgoing telegrams to ITAPS. 

Telex subscribers can dial up "923" reques- 
ting a direct connection to the ITAPS through 
the microprocessor interface. If there is buffer 
available in the Communication Processor, the 
ITAPS, after obtaining subscriber's ID, will send 
"GO AHEAD CABLE". The subscriber then send his/ 
her prepared paper tape or type in telegrams. 
After receiving End of Message (NNNN), the ITAPS 
will again verify the same ID to make sure that 
the same circuit has been connected throughout 
the entire period of telegram transmission. If a 
match occurred in the verification, the sending 
subscriber is then given a receipt number on which 
a later inquiry of the telegram may be made. For 
the convenience of the customers, the telegrams 
sent by the Telex subscribers are in simple format. 
The ITAPS will edit the simple format into the 
CCITT F.31 format by automatically filling in the 
numbering line, pilot line, word count, destina- 
tion indicator, and origin indicator, etc. to 
become an internationally compatible interchange 
format. Based on the destination indicators or 
geographical indicator, the telegram is then routed 
onto the required destination international trunks. 

For those telegrams sent from ITA's branch 


offices, there is no dialing-up procedure needed, 
instead, the prepared paper tape can be sent 
directly from Teletype paper tape reader or key- 
board into the ITAPS, or telegrams can be edited 
on the CRT screen and sent to ITAPS by a single 
key action. The remote CRTs are connected to 
the Communication Processor through modems using 
asynchronous RS-232C protocol at 1200 bauds. 
Telegrams may also be input to the ITAPS using 
local CRTs which are directly connected to the 
Message Switching Processor using asynchronous 
protocol at 9600 bauds. 


Journaling and Retrieval of Telegrams 


~Felegrams input/output into/from ITAPS are 
properly recorded or journaled into the short- 
term input/output journal file in the moving head 
disks. Input journal file contains telegrams that 
are originally input into the system with their 
arrival time stamps and system numbers from which 
the respective telegram can be retrieved. Output 
journal file contains telegrams that may have been 
automatically edited into the F.31 format or man- 
nually corrected some erroneous fields in the 
message header, together with their leaving time 
Stamps, system numbers and other information ex- 
tracted from the telegram header. For retrieval 
and report-printing purpose, many inverted files 
are built at the time of output journaling such 
as DELINV (delivery number), ICPINV/OCPINV (Input/ 
Output Circuit Prefix), TIMINV (Time), TIGINV (Te- 
legram ID Group) to facilitate multi-directional 
retrieval from other keys. The on-line retrieval 
commands can be entered from 5-unit Teletypes 
locally and remotely, and from local CRTs. 

Two 80-megbyte disks are installed to allow 
telegrams in two days to be journaled, while mag- 
netic tapes are used to transcribe telegrams for 
long-term filing. The retrieval of telegrams from 
magnetic tape can be made off-line. 


System Switchover and Recovery 


ITAPS is designed to have dual configuration. 
During normal operation one system serves as on- 
line and the other as hot-standby. The on-line 
system does all work including telegram reception, 
assembly, storing, analysis, routing, dispatching 
and disassembly while the standby system does only 
the telegram reception and assembly. The on-line 
message processor continously sends information 
to the standby message processor ordering it to 
release those buffers whose contents have been 
safely written (stored) into the transit storage 
by the on-line Message Processor. Also, in the 
on-line system, a snapshot program periodically 
saves system operational data and tables onto the 
drum (head-per-track) snapshot area including the 
current queue transactions, data and tables in 
common area. If the on-line system fails, or 
either side receives a switchover command, the 
hot-standby system will immediately loads the 
last snapshot area into its core memory, changes 
its own processor state to on-line, and then take- 
over the on-line task without having to load the 
on-line programs from the system disk. During 
normal operation, the same set of real-time pro- 


grams are running or stationary both in the on-line 


and in the standby system respectively, but the 


input processing program is running in on-line 
state or in standby state depending on whether the 
respective processor is in which state. This — 
arrangement allows fast switchover action to be 
taken place. 

After the switchover, the standby processor 
backs up its processing starting from the last 
Snapshot of the system which preserves all the 
necessary information to start over from the last 
mile-stone record. Such arrangement will guarantee 
that no telegram message or character will be lost 
during the switchover transition. 

The system can also be restarted using a 
restart procedure LU(RESTART) to recover all the 
necessary data which have been saved in the snap- 
shot area. 


A Chinese Computer System for File Building 


One of the requirements of ITAPS is to print 
Chinese address information on the received tele- 
grams to facilitate messenger's delivery. This 
requirement motivated the invention of a Chinese 
computer system for file building purpose. This 
system uses an ordinary graphic CRT terminal 
to input and display the selected Chinese characters. 
This is accomplished by assigning key positions for 
Chinese character roots and any Chinese character 
can be defined according to the normal writing 
sequence as a one-dimensional spelling sequence of 
its constituent roots. Such a system can be built 
in a general purpose computer system as part of 
the file handling process. Besides building cable 
address file with Chinese address information, the 
system can be used for general purpose Chinese 
information storage and retrieval purposes. The 
special characteristics of the input method are 
described as follows: 

1. Use ordinary small CRT keyboard without 
special interface, 

2. The number of roots approaches theoretical 
optimum value, which means minimum average key 
strokes, speedy operation and high uniqueness of 
the selected characters. | 

3. The arrangement of roots is on the one 
hand according to the statistical occurent fre- 
quency of roots, which makes the average operation 
speed faster; and on the other hand according to 
the connotational meaning of root to key-position 
alphabets, thus to facilitate beginners’ memorizing. 

4. Spelling sequence can be defined dynami- 
cally by users, thus make the selection operation 
more flexible and multi-directional, for exam- 
ple, normal character being selected from the 
abbreviated writing sequence; multiplication of 
a number and a root to denote the repetition of the 
same root; subtraction of roots can be performed 
for similar roots. 

5. Processing program is very simple, within 
2K words, and the additions of spelling sequences 
and character patterns are independant of the 
processing program. 

It may also be used for English, Japanese, 
Korean and other ideographic languages. | 


Local and Remote CRTs 


Local positions are installed at the Telegraph 
Operation Center to manually assist the system for 
the handling of the intercepted telegrams either 
having format, routing, spelling errors, or un- 
identifiable name or field in the telegram header, 
owing to which the telegrams cannot be properly 
routed or delivered. After human intervention for 
the proper correction, these intercepted telegrams 
will be routed to their proper destination or be 
diverted to a specified printer for further ana- 
lysis. 

The functions of the local CRTs are categori- 
zed by pressing the different function keys. These 
functions keys are built-in to each CRT in the 
right hand neighboring of the normal keyboard area. 
Each of them can trigger a pre-defined process in 
the CRT Processing Module. 

Another block of key area further right hand 
side of the special function keys allows operator 
to edit the telegram in a page mode. By "page" 
mode, it means that the purchased CRTs have a 
local buffer and the limited intelligence to allow 
operator editing telegrams locally or without inter- 
vening the host computer, thus leaving the host 
computer with more CPU time to process other on- 
line tasks. 

The remote CRTs have limited functional capa- 


bilities. The use of function keys are limited to 
send out a newly-edited telegram and to log-in and 
log-off. | 


Operational Commands and Reports 


Various user's designed commands can be en- 
tered at the console CRTs, local CRTs, and the 
remote Teletype positions either to control or 
regulate the system operation, or to obtain the 
current operational status of the system. Commands 
entered from console CRTs are honored by the 
Executive of the Operating System while those 
entered from local CRTs and the 5-unit-code TTYs 
are interpreted and executed by a user's designed 
command interpreter and its associated subroutines 
working in the foreground environment. 

The command entered from the remote TTY posi- 
tions resemble those used in the network access 
operation because they are relayed by the Communi- 
cation Processor to the Message Switching Processor 
for the proper responses. Each command response 
must be returned to the respective TTY that has 
issued the command. The technique used is in fact 
a "packet switching" type transmission of both 
command and the response. These commands can be used 
to obain operational reports or retrieve telegrams. 


Software Design and the Parallel Running Result. 


The design of the application software for 
ITAPS uses the top-down and modular concepts [4]. 
Each module has its pre-defined functions and the 
related mudules have their interfaces. In Message ~ 
Switching Processor, there are Dispatching Module, 
Input Processing Module, Message Analysis and Route 
Selection Module, Output Processing Module, CRT 
Processing Module, Telex Editing Module, Journaling 
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Module, and Command Processing Module. These 
modules are assigned different priority level 
according to the degree of urgency of each module. 
The main interfaces are previded using the "admin- 
nistrative block" appended to the first sector of 
each telegram in the transit storage, and the 
interface tables in high core common area. 
Transactions are passed around in core memory 
using the self-implemented queue manager. There 
are two types of queue, one is the cyclic type 
FIFO queue and the other is the multi-line multi- 
priority queue for output processing module. The 
Same approach is also carried on in the Communica- 
tion Processor in which three receiving modules 
and three transmitting modules are implemented 
except without using the drum transit storage as 
the interface among modules, instead, the packet 
buffers being used as such. The receiving modules 
include Input Interrupt Handling Routing, Input 
Processing Program, and SDLC Output Interrupt 
Handling Routine. The transmitting modules include 
SDLC Input Interrupt Handling Routing, Output 
Processing Program, and the Output Interrupt 
Handling Routines (see Fig. 4). 

The software for the communication lines, 
such as the SDLC, CRT multiplexor, low speed 
multiplexor, and RS-232C multiplexor, is imple- 
mented inside the Input Output System (10S) as 
Handling Routines (Handlers) which is device 
dependant and user-oriented portion of programs 
linking closely to the respective drivers. 

ITAPS has passed various phases of testing 
including the modular testing, integrating testing, 
functional verification testing, and the stability 
testing. Up to now (June 1981), the system has 
been running as the parallel running with the 
manual processing system for more than ten weeks. 
During the testing period, the real traffic as 
well as the simulated traffic are both applied to 
the system. For the modular testing and the inte- 
grating testing, the simulated traffic helped 
prove the correctness of the normal processing 
path of each module, while for the functional 
verification testing, stability testing and paral- 
lel running, the real traffic helped prove the 
correct treatment of the abnormal cases. The 
overall availability of the system within the 
parallel running period is above 99.9%. 

This system is one of the large scale software 
development project in this country. This is one 
way of achieving self-reliance in the brain-inten- 
Sive industry in this country. 
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Abstract -- Distributed processing system reliability has been 
measured in the past in terms of point-to-point terminal reliabil- 
ity, or more recently, in terms of the ‘survivability index’ or 
‘team behavior.’ While the first approach leads to oversimplified 
models, the latter approaches imply excessive computational 
effort. A novel, computationally more attractive measure based 
on multiterminal reliability is proposed. The measure is the pro- 
bability of true value of a Boolean expression whose terms denote 
the existence of connections between subsets of resources. The 
expression is relatively straightforward to derive, and reflects 
fairly accurately the survivability of distributed systems with 
redundant processor, data base and communications resources. 
Moreover, the probability of such Boolean expression to be true 
can be computed using a very efficient algorithm. This paper 
describes the algorithm in some detail, and applies it to the relia- 
bility evaluation of a simple distributed file system. 


1. Introduction 


Distributed processing has become increasingly popular in recent 
years, mainly because of the advancement in computer network 
technology and the falling cost of hardware, particularly of 


microprocessors. Intrinsic advantages of distributed processing 


include high throughput due to parallel operation, modular growth, 
fault resilience and load leveling. 


In a distributed processing system (DPS), computing facilities and 
communications subnetwork are interdependent of each other. 
Therefore, a failure of a particular DPS computer site will have a 
negative effect on the overall DP system. Similarly, failure of the 
communication subsystem will lead to overall performance degrada- 
tion. 


Recently, considerable attempts have been made to systematically 
investigate the survival attributes of distributed processing systems 
which are subject to failures or losses of processing or communica- 
tion components. Two main approaches to DPS. survivability 
evaluation have emerged: 


a) In [MER 80] the term survivability index is used as a perfor- 
mance parameter of a DDP (distributed data processing) system. 
An objective function is defined to provide a measure of survivabil- 
ity in terms of node and link failure probabilities, data file distribu- 
tion, and weighting factors for network nodes and computer pro- 
grams. This objective function allows the comparison of alternative 
data file distributions and network architectures. Criteria can be 
included such as the addition or deletion of communication links, 
allocation of programs to nodes, duplication of data sets, etc. 
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Constraints can be introduced which limit the number and size of 
files and programs that can be stored at a node. The main disad- 
vantage of the survivability index is its computational complexity, 
which makes it practical only to DDP systems with, say, less than 
20 nodes or links. 


b) The second approach is a ‘team’ approach in which the overall 
system performance is related to both the operability and the com- 
munication connectivity of its ‘member’ components [HIL 80]. 
The performance index, defined axiomatically on the connectivity 
state space of the graph, captures the essentials of the ‘team effect’ 
and allows survivability cost/performance trade-offs of alternate 
network architectures. The basic advantage of the team approach is 
that performance degradation beyond the connected/disconnected 
State is measured. One disadvantage of the approach is that of being 
restricted to the homogeneous case and of ignoring other important 
details of real DPS’s. 


In this paper we propose a novel measure of DPS survivability, 
namely multiterminal reliability. We recall that in a communications 
network terminal reliability relative to node pair (/,/) is the proba- 
bility that node / is connected to node j. We extend this notion to 
DPS’s by defining the multiterminal reliability as follows: 


Definition I. The multiterminal reliability of a DPS consisting of a 
set of nodes (processors) V=1,2,...,N is defined as 


eo Prob Ch, D) Ch Jy Dy @B,_} Ch J (1) 
where: 
L155) 4o,49,.54,,5, are subsets of V 


Cc ee denotes the existence of connections between all the 
nodes of the subset / ; and all the nodes of subset J; 


and 


©, has a meaning of OR or AND. 


The choice of the subsets /,,J/),...,/,,/, as well as the interpretation 


of the operator ®, (ij = 1,---,K-—1) depend on the event (task) 
whose survivability is being evaluated. Priority between operators 
is determined by parentheses in the same way as in standard logical 
expressions. 


* This research was supported by the Office of Naval Research 
under contract N00014-79-C-0866. Aksenti Grnarov is currently 
on leave from the University of Skopje, Yugoslavia. 


As an example, let us assume that the successful completion of a 
given task requires node A to communicate with node B or node C; 
and node D and E to communicate with node F andG. The mul- 
titerminal reliability of such task is given by 


Pn = Prob ( Cid; OR Ch Jy) AND Crd; 


where /; = {A}, Jj ={B}, J. = {C}, 13 = {D,E} and J3 = {F,G}. 
The general definition of multiterminal reliability can be specialized 
to characterize the survivability of the following systems: 


(A) Distributed Data Base System: For given link and computer 
center reliabilities, determine the reliability of a specific file alloca- 
tion including redundant copies. 


(B) Teamwork: Given link and processing node reliability, deter- 
mine what distribution of the members will result in highest proba- 
bility of a connection. 


(C) Distributed Data Processing System: Given link and processing 
node reliability and (redundant) distribution of programs and data, 
determine the probability of successfully completing a specific appli- 
cation. 


(D) Computer-Communication Network: Given link and node relia- 
bility, determine the probability of the network becoming parti- 
tioned. 


Note that in all the above applications, system (or application) sur- 
vivability is best characterized by some multiterminal reliability 
measure. In fact, terminal reliabilities alone could not be used to 
compute systems survivability because of the dependencies existing 
between the various events. 


In this paper, an efficient algorithm for multiterminal reliability 
analysis is presented. The algorithm can be applied to oriented and 
non-oriented graph models of DPS’s and can produce numerical 
results as well as symbolic reliability expressions. 


The paper is organized in five sections. In Section 2, the applica- 
tion of Boolean algebra to multiterminal reliability is considered. 
Derivation of the algorithm is presented in Section 3. An example 


for determination of the multiterminal reliability is given in Section | 


4 . Some comments and concluding remarks are presented in the 
final section. 


2. Boolean Algebra Approach 


For reliability analysis a DPS is usually represented by a probabilis- 
tic graph G(V,E) where V=1,2,...,.N and E = aj ay,...,ag are 
respectively the set of nodes (representa the processing nodes) 
and the set of directed.or undirected arcs representing the com- 
munication links. To every DPS component i (processing node or 
link), a stpenastc variable y,; can be associated. The weight 
assigned to the /% A component represents the component reliability 


p; = Pr(y; = 1) 


e., the probability of the existence of the / 
ables are supposed to be statistically independent. 


‘th component. Vari- 


There are two basic approaches for computing terminal reliability 
[FRA 74]. The first approach considers elementary events and the 
terminal reliability of a connection from source s to termination 1, 
by definition, is given by 

Py = » P, 

-Fle=1 

where P, is probability which corresponds to the event e and 
F(e)=1 means that the event is favorable, i.e., it includes a path 
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from s to ¢. 


The second approach considers larger events corresponding to the 
simple paths between terminal nodes. These events however are 
no longer disjoint and the terminal reliability is given by the proba- 
bility of the union of the events corresponding to the existence of 
the paths. 


The complexity of these approaches is caused in the first case by 


the large number of elementary events (of the order 2” where n =~ 


the number of elements which can fail) and in the second case by 
the difficult computation of the sum of the probabilities of nondis- 
joint events (the number of joint probabilities to be computed is of 
the order 2™ where m = the number of paths between node pairs). 


Fratta and Montanari [FRA 74] chose to represent the connection 
between nodes s and ¢ by a Boolean function. This Boolean func- 
tion is defined in such a way that a value of 0 or 1 is associated 
with each event according to whether or not it is favorable (i.e., the 
connection C, , exists). Since the Boolean function corresponding 
to the connection C, , is unique, this means that the connection 
Cy. , can be completely defined by its Boolean function. Represent- 
ing a connection by its Boolean function, the problem of terminal 
reliability can be stated as follows: Given a Boolean function F57, 
find a minimal covering consisting of nonoverlapping implicants. 


Once the desired Boolean form is obtained, the arithmetic expres- 


sion giving the terminal reliability is computed by means oF the fol- 
lowing correspondences 


X; ~~ Dj 


x; 7 gq =1-3; 


Boolean sum — arithmetic sum 
Boolean product — arithmetic product 


A drawback of the algorithms based on the manipulation of impli- 
cants is the iterative application of certain Boolean operations and 
the fact that the Boolean function changes at every step (and may 
be clumsy). The DOOIéan function may bE simplified USiNg one Of 
the following techniques: absorption law, prime implicant form, 
irredundant form or minimal form. Any one of these procedures 
however requires a considerable computational effort. Therefore, it 
can be concluded that these algorithms are applicable only to net- 
works of small size. 


Recently, efficient algorithms based on the application of Boolean 
algebra to terminal reliability computation and symbolic reliability 
analysis were proposed in [GRN 79] and [GRN 80al respectively. 
The algorithms are based on the representation of simple paths by 
‘cubes’ (instead of prime implicants), on the definition of a new 
operation for manipulating the cubes, and on the interpretation of 
resulting cubes in such a way that Boolean and arithmetic reduction 
are combined. 


The proposed algorithm for multiterminal reliability analysis is 
based on the derivation of a Boolean function for multiterminal 
connectivity and the extension of the algorithm presented in [GRN 
80b] to handle both multiterminal reliability computation and sym- 
bolic multiterminal reliability analysis. 


3. Derivation of the Algorithm 


Before presenting the algorithm for multiterminal reliability 
analysis, it is useful to recall the definition of the path identifier 
from [GRN 79]: 


Definition 2. The path identifier /P, for the path a, is defined as a 
string of m binary variables 


IP. = XyXQ.0.Xjo- Xp 


where 


x;=1 if the i” component of the DPS is included. 


in the path 7, 


X; = X otherwise 


and n is the number of DPS components ‘that can fail, i.e: 


= 


= N in the case of perfect links and imperfect nodes 


= 


= F in the case of perfect nodes and imperfect links 


n = N-+E in the case of imperfect links and nodes. 


AS an example, let us consider a four node, five link DPS given in 
Figure 1, in which nodes are perfectly reliable and links are subject 
to failures. The sets of path identifiers for the connections Cs 4 
and C's 7 are given in Table 1 and Table 2 respectively. 


TABLE 1 


S X,A 


S X3B XsA 
S X3B X4IT XA 


TABLE 2 


S xjA xXoT 
S x,A xsB x4T 
S X3B X47 ; 

S x3B xsA x9T 


Figure 1. Example of DPS 


Boolean functions corresponding to Cs 4 and Cs 7 given by their 
Karnaugh maps, are shown in Figure 2. 


Instead of the cumbersome determination of elementary (or com- 
posite) events which correspond to a multiterminal connection, the 
multiterminal reliability can be determined from the Boolean func- 
tion representing the connection. Moreover, the corresponding 
Boolean function can be obtained from path identifiers (Boolean 
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X3x, OO O1 11 10 


X5 = 0 
S,t 


Figure 2. Karnaugh Map Representation of the Connections 
Cs a and Cor 


functions) representing terminal connections. For example, the 
Boolean function corresponding to the multiterminal connection 


Cor = Cs.4 OR Cos 7 


mor 
can be obtained as 
Fo tsa UO Lge 


where U is the logical operation union. Karnaugh map of F,,,,. is 
shown in Figure 3. 


X 1X9 


x3X4 O00 O1 1 10 xX3x4 O00 O1 1 


of] | firyri 00 
01 7 


De Ol 


fi [iki [i 


X5 = | 


Figure 3. Karnaugh map representation of the connection 
Chor = Cs4 OR Cg p 


Covering the Karnaugh map with disjoint cubes, we can obtain F,,,,, 
as 
F nor= Xy is X1X3X5 a X1X3X4X5 


i1.e., multiterminal reliability is given by 
Por = Py + 91 P3P5 + 91 P3P495 
Analogously, the Boolean function corresponding to the multitermi- 


nal connection 
C nid = Cs4 AND Cs r 


can be obtained as 


er oe Fs 4 A For 


where A. is the logical operation intersection. . 


According to the Karnaugh map representation (Figure 4), Foy is 
given by . 


XX Ay) 

x3x4 00 O1 11 10 

00 id 

01 i] 

uf CG 

10 . is _ 

X5 = 0 
ae 
Figure 4. Karnaugh Map Representation of the connection 
Cand = Cs.a AND C, , 


the following set of cubes 
IP = (11xxx,xxl11,1xx11,x11x1,x111x,1xl1x) 


Applying the algorithm REL [GRN 80bl we obtain that the mul- 
titerminal reliability is given by 


Pmand = PP? + P3P4Ps—P1 D9) 


+ P\P4P59293 + 91 P2P394Ps + 91 P2P3P4 


Since the logical operations union and intersection satisfy the com- 
mutative and associative laws, previous results can be generalized 
as follows. 


1) Multiterminal connection of OR type Cy 7 
(T=h,, ty, ee eR a ty) is equal to 
Cyr = Cy, OR C,, OR OR Cy 4, 
and the corresponding Boolean function F, 7 can be 
obtained as 
oe aa ee U Fst U U Fs 
2) Multiterminal connection of AND type C, 7 
(T ={t,t,...,t}) is equal to 


Cc 


S 


= Cy, AND C,,, AND +++ AND Cy, 


and the corresponding Boolean function F, 7 can be 
obtained as 


F. 


Syl 


Fy = Fy A Fog Awe A 


In the case when all nodes from the set S have connections of the 
same type with all nodes from the set T, multiterminal connection 
can be written as Cg 7. 


4. Determination of F's 7 


The determination of Fs 7 by Boolean expression manipuiation or’ 


by determination of elementary events is a cumbersome and time 
consuming task. Hence, these methods are limited to DPS’s that 
are very small in size. 


However, since path identifiers can be interpreted as cubes, the 
Boolean function Fy r can be more efficiently obtained by manipu- 
lating path identifiers. In the sequel we present the OR-Algorithm 
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cl = a,a,°--a;':: a, and c*® = b)b):-- b+ °° 


and the AND-Algorithm for the determination of For of type OR 
and AND respectively. Both algorithms are based on the applica- 
tion of the intersection operation [MIL 65]. Since the path 
identifiers have only symbols x and | as components, the intersec- 
tion operation can be modified as follows: 


Definition 3: The intersection operation between two cubes, say 
++ b,, is defined as 


c’ Ac’ =[(a, A by),(ay A by),...,€a, A 5, ..., Ca, A 5,)] 


where the coordinate A operation is given by 


A 1 x 
l 1 1 
Xx 1 x 


It can be seen that the intersection operation between two cubes 
c’ and c* produce a cube which is common to both c’ and c*. If 
c’ A c’ =c" this means that the cube c’ is completely included in 
the cube c*. The modified intersection operation produces a cube 
which has only symbols x and 1 as coordinates, so the modified 
intersection operation can be applied again and again. Also, the pre- 
vious fact allows us to apply the REL-Aigorithm on the set of cubes 
obtained by the application of the modified intersection operation. 


Let us suppose that the cubes corresponding to. connections 


Cy, 7, and C,, 7, are stored in lists L; and L of length k, and ky 


respectively. Let c/ denote the ja element of the list L,. 
The OR-Algorithm for the computation of Fg 7 follows: 
OR-Algorithm | 

SPER. 


for i from 1 to k, do 
for j from 1 to ky do 


begin 
c=ci Ach ;,ifc = cj then 
begin’ . 
delete cj; from list Ly ; 
end : 
else if c = c4 then delete c4 from list L, ; 
end 
SLEEP. 2: 
Store undeleted elements from the lists L, and L, as new list 
L, | 
END 


As an example, the OR - algorithm is applied to the determination 
of Fy p= Fy, U F,, for the DPS given in Figure 1. The lists 
L, and L» are 


Ly L» 
cf 1Xxxx c4 11xxx 
cf xx1xl c Ixxl] 
cp xlllx cP — xxlix 

cH xxl 


STEP 1: 
STEP 1: c} A c} =c} delete c 
cf A ci =¢3 delete ce 


c} A c3 Hc} Ach: 
c} A eecl Fc; 


ef A cf xc? #c3 


cf A cf =cz delete ct 
cp A cP=c;? delete c? 
STEP 2: 
Ly 
cf 1XXxxx 
cp xxix 
cp xxl lx 


It can be seen that the OR - Algorithm produces a list with minimal 
number of elements which are cubes of the largest possible size. 
The same result could have been obtained from the identification 
of disjoint cubes directly in Fig. 3. Our method allows for the 
efficient generation of all disjoint cubes necessary for reliability 
analysis [GRN 79]. Next, we introduce the AND algorithm. 


AND-Algorithm 


STEP 1. 
for i from 1 to k, do 
begin 
for j from 1 to ky do 
Cha = cf Ach; 
for k from 1 to k,-1 do 
begin 
m = k+1 
while ci, ~ckK, Ac™, and 
m<k» do 
begin 
= ch A Ci42 ) 
if c = ci" then delete c” 
from list L; 
m= mt+l 
end 
if m < ky then delete cX,, from 
list Li49 
end 
end 
STEP 2 | 
Store undeleted elements from lists £3, ...,L,,49 


as a new list Ly 


END 


As an example, the AND-Algorithm is applied to the determination 
of Fy 7 =F, A F, , for the DPS in Fig. 1. 
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STEP 1. 


io] 

Step 1.1 
cd =c} Act =11xxx 
cf =cl A cs = Ixxll 

L = 

3 |e} =ct Ac? =1xllx 
cf =c} Ac} =111x! 

Step 1.2 


cl Ac? # cl # cH 
cf Ac? # ec] # c} 


cf A cH =c} delete c} 


11xxx 
L3 = j1xxll 
Ixllx 


Step 1.1 


ef = 111x1 
cf = 1x11 

L = 
4 1c? = xx1l1 


Cc; = xI1xl 


Step 1.2 


xxl11 
4 Ixl1xl 


Step 1.1 


on) 


?  Ixxll 
cp Ixl1x 
cf = xxlll 
cp xi1xl 
cf = xlllx 


It can be seen that the AND - Algorithm also produces a list with 
minimal number of elements which are cubes of the largest possi- 
ble size. 


In the general case, the Boolean function corresponding to the con- 
nection Cy 7 where S={s; so ...,5,} and T = {ty ty...,t,}, can be 
obtained using the multiterminal Algorithm (m ®-Algorithm) 
described below: 


m@®-Algorithm 


STEP 1: 


Find the path identifiers for terminal connection 5; t, and store 
them in the list L, ; i. : 


STEP 2: 


Sort the path identifier in L, according to increasing number of 
symbols 1 (i.e. increasing path length); 


STEP 3: 


if i < k continue. Otherwise go to step 5 


STEP 4: 
for j = Fie 


Step 4.1 
Find the path identifiers for terminal connection 5, tj and 
store them in the list L 


Step 4.2 
Sort the path identifiers in L, according to the increasing 
number of symbols "1" 


Step 4.3 
Perform © -Algorithm on the lists L; and L4 


Step 4.4 
i < i+1; go to step 2. 


m (j, =2 if i=1, otherwise j, = 1) 


END 


In the algorithm, ® denotes OR or AND depending on the connec- 

tion type. The sorting of the lists allows faster execution of the 

algorithm (starting with the largest cubes results in earlier deletion 

of covered cubes, i.e., faster reduction of the lists during the execu- 
tion of Step 4.3). 


Based on the previous results we can propose the following algo- 
rithm for multiterminal reliability analysis: 


MUREL- Algorithm 


STEP 1: a 
Derive the multiterminal connection expression 
corresponding to the event which has to be analyzed. 

STEP 2: 

Determine the Boolean function corresponding to the 
multiterminal connection by repetitive application of 
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the m © - Algorithm. 

STEP 3: 
Apply the REL Aisottiin to obtain the multiterminal 
reliability expression or value. | 


Regarding the computational complexity of the MUREL-Algorithm, 
the following observations can be made: | 


i) The ® -Algorithm can be implemented using only logical 
operations which generally belong to the class of the fastest 
instructions in a computer system. 


ii) The m @ -Algorithm produces a minimal set of maximal 
cubes (i.e., minimal irredundant form of the Boolean fync- 
tion). 


iii) The REL-Algorithm is the fastest algorithm for the deter- 
mination of the reliability expression or for the reliability 
computation from the set of cubes (path identifiers). 


From the above considerations we conclude that the proposed algo- 
rithm can be applied to DPS of significantly larger size than was 
possible with other existing techniques. 


In the following section, the algorithm is illustrated with an applica- 
tion to a small distributed system. 


5. Example of Application of the Algorithm 


As an example of the application of the algorithm we compute the 
survivability index for the simple DPS system shown in Figure 5 
(the example is taken from [MER 80]). Assignment of files and 
programs to nodes.is shown in figure 5. __ 


[NODEx, | 


Figure 5. Four Node DDP 


FA denotes the set of files available at a given node, FN; denotes 
the files needed to execute program PM, and PMS designates tle 
set of programs to be executed at that node. 


Let us assume that for a given application, we are interested in the 
survivability of program PM3. Likewise, for another application, F 
we need both programs PM, and PMg to be operational. We m3 
separately analyze these two cases using aS a measure for surviva- 
bility the multiterminal reliability (probability of program execu- 
tion). The two problems can be stated as follows: 


Applying the OR - algorithm on Fon, and Fy 1, we obtain 


L11xl1lxx 
L1xllxlx 
Applying the REL - Algorithm on F,,3 we obtain 


Given: node and link reliability, and file and program assign- P43 = P1P2P3P4PsP6 + P\P>P4PsP7(1 — P3P6) 
ments to nodes. 


Assuming p; = .95 V i, we have: Py_3_ = -85 
Find: The survivability of: 
1) Program PM, 
5.2. Survivability of both PM; and PM, 
2) Both programs PM; and PM. Sous 
. The survivability of PM is equal to the multiterminal reliability of 
connection 
Cig HC 
Survivability of PZ; m8 4,13 
where /; = {1,3}. The connection C4 ;, is equal to 
The survivability PM 3 is equal to the multiterminal reliability of 


connection C41; = C4) AND C43 


C3 = C27, OR Cy 7, ee ae . 
Paths and corresponding path identifiers for the connections 
where J; = {1,3} and J) = {1,4} The connections C2), and Cy;, C4, and C43 are shown in Figure 7. 
are equal to 


i 


Coy, = Co 1 AND C4 3 


Cay 
Cy 7, = C2, AND Cy 4 paths F4y 
X4X7X1 Lxx1xx1x 
Paths and corresponding path identifiers for the connections X4Xgx3X6X] Ixl1x1x1 
Cy 1. C23 and Cy 4 are shown in Figure 6. 2 
4,3 
paths F443 
X4XgX3 xxl 1xxxl 
X4X7X1X6X3 Ixllxllx 
C21 
ths F ‘ 
es - Figure 7. Paths and Path identifiers for Connections 
X1X5X9 11xx1xxx C4 and C4 3 
ee ea Renae ea ee ee ree ee 
C23 
paths F, 3 ; 
D - Algorithm on Fy, , and F4 3 we obtain 
Kp XsXqX6X3 isd tex Applying the AN gorl a ; 41 4.3 
Xp X5X7X7X4XQ Lixl1lxll lxl ixxll 
Ixllxllx 
C24 IxI1 xxl 
paths Fy 4 
X1X5X7X7Xq LixI1x1x 
Xj X5XX6XzXQX4 111111x1 


Applying the AND - Algorithm on F,,3 and F,,.3 we obtain 


Figure 6. Path and Path Identifiers Representing Connections L111 11x 
Cy 1 Cy 3. and Cy 4 1i1111xl 
L11l11ixil 


Applying the AND - Algorithm on F,, and F) 3, and Fy, and Fy 4 
we obtain 


Applying the REL - Algorithm on F,, we obtain 


Pin = P\P2P3P4PSP6P7 + P\P2P3P4P5P697P8 + P\P2P3P4P596P7P8 
Assuming p;=0.95 Wi, we have: p,, = 0.778 


Foy, Fo 1, 
L11xllxx -  LIxllxlx 
lixllxll lilllixl 
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6. Conclusion 


In the paper, the multiterminal reliability is introduced as a meas- 
ure of DPS survivability and the MUREL-Algorithm for multiter- 
minal reliability analysis of DPS.is proposed. First, the event under 
study is expressed in terms of its multiterminal connection. Then 
the m @ -Algorithm is used to translate the multiterminal connec- 
tion into a Boolean function involving all the relevant system com- 
ponents. Finally, the multiterminal reliability is obtained from the 
Boolean function by application of the REL-Algorithm. 


Preliminary computational complexity considerations show that the 
MUREL-Algorithm permits the survivability analysis of DPS of 
considerably larger size than using currently available techniques. 
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Abstract -- This paper discusses the problem 
of blocking in open exponential queueing networks. 
It is pointed out that such networks can be viewed 
as queueing network models of message-switched 
data communication networks with local flow- or 
congestion-control. Analysis is done by perform- 
ing a node-by-node decomposition, and it is argued 
that an "off-line" analysis can be made, where the 
main problem is to analyze a single-server finite 
capacity queueing system with Markovian arrivals 
and a Coxian service time distribution. The 
method is applied to a number of example networks 
and evaluated by comparing the results obtained 
with those results obtained through exact analysis, 
simulation, or other approximate methods. We find 
that the method provides a good approximation 
procedure for obtaining system performance measures 
such as blocking-probabilities, throughput rates, 
etc. 


Introduction 


An open queueing network is a collection of 
nodes or servers that offer some form of service 
to customers in the network. A customer may enter 
the network at some node, receive service, and 
then immediately go to another node for additional 
service or he may leave the network. At any given 
time the number of customers in the network is a 
stochastic variable. In this paper we concentrate 
on exponential queueing networks where at any time 
the number of customers in a node may not exceed a 
certain number. This implies that customers 
currently not in that node and who want to go to 
that node are prohibited from doing so and will be 
held in their current nodes until the congestion 
is resolved. The interest in such a queueing net- 
work model was generated by an interest in gaining 
a better knowledge with regard to the influence of 
local flow control in a message- or packet-switched 
data communication network. Therefore, we prefer 
to present the detailed queueing network model 
through the terminology of data communication net- 
works. 


Flow control] or congestion control in data 
communication networks are protocols that regulate 
the traffic flow input to the network or a switch 
node. The reason for introducing such control 
mechanisms is to try to minimize the impact of 
possible congestion and overflow due to the conten- 
tion of a smaller number of resources by a large 
number of users [GERL 80]. Often flow control 
strategies are characterized as global control and 
local control. The global control refers to a 
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control of the number of messages currently out- 
Standing in the network between end users. The 
local control refers to a limit placed upon the 
number of messages currently residing in one node 
of the network. The impact of global flow control 
is fairly well understood, and there exists a 
number of excellent publications dealing with this 
subject, see the references in [GERL 80]. The 
local control strategy is much more difficult to 
analyze and very few results can be found in the 
Some very useful results can 
however be found in [PENN 75]. We intend to 
present an approximate method that allows us to 
better understand the impact of a local flow 
control mechanism. For simplicity we will assume 
a network operating with a fixed routing algo- 
rithm. Consequently, it is possible to identify 
a number of fixed source-destination paths in the 
network. We will analyze a path consisting of M 
nodes that the messages have to pass through from 
the source to the destination. We will assume 
that the local control allows a maximum number N. 
of messages in node i of our path. 


The local control implies that a message 
arriving to the head of the queue in node i when 
there are Niwy messages in node itl, i.e., node 


i+] is filled to its capacity, is blocked and has 
to wait until one of the messages in node i+] is 
transmitted. When the blocking is resolved, the 
message can be transmitted immediately. In order 
to evaluate such a scheme, we need to analyze a 
queueing network model consisting of M finite 
capacity queues in tandem. 


The tandem network of finite capacity queues 
is extremely difficult to analyze except for 
certain trivial cases. It is however always 
possible to use a numerical procedure in order to 
find interesting quantities. This is accomplished 
by generating a Markovian system possibly by 
approximating the arrival process and service 
process by Coxian processes [KLEI 75]. A Coxian 
arrival process is a stochastic process where the 
interarrival time distribution is Coxian and 
successive interarrival times are independent. 


The numerical method has certain advantages, 
but it very quickly becomes impractical if the 
number of states in the Markov chain is large. 


A purely analytical approach is very diffi- 
cult again due to the fact that the. dimensional ity 
of the state-space is often too large. 


Consequently, an approximate method that 
allows us to obtain almost accurate results for 
the steady-state probabilities and associated 
quantities such as network throughput and message 
delay seems to be a viable alternative. 


Approximate analyses of exponential queueing 
networks of the type we are interested in -have 
appeared in the existing literature. The classi- 
cal paper by Hunt [HUNT 56] provides the first 
introduction to this difficult problem. More 
recent papers are those by Hillier, et. al., 

[HILL 67] that concentrates on finding the network 
throughput, and the paper by Takahashi, et. al., 
[TAKA 80] in which an approximative method based 
upon an M/M/1 queueing model with adjusted arrival 
rates and effective service rates is given. In 
the next section we will present a method that we 
believe and also show to be better and more effec- 
tive than other existing methods. 


Approximate Analysis 


Our queueing model of the logical link with 
local control is a tandem network with finite 
capacity queues. The service time distribution in 
the i:th node is exponential with parameter Wa and 


the arrival process to the queueing network is 
Poissonian with rate }. The capacity in the i:th 
node is N. messages; included in this is the 


message currently under transmission if any and 
the messages waiting to be transmitted. We also 
assume that messages arriving to node 1 when the 
node is filled to its capacity are lost from the 
system and the last node, node M, cannot have any 
blocked messages. 


Let ne = the number of messages in node k and 
define a, = Piny = NL} as the "blocking" proba- 
bility for the k:th node. Clearly if ay is known, 


the total throughput for the tandem link is 
ACT = ay) Since we do not ailow messages to be 


lost or destroyed once they have been given access 
to the tandem link. This being the case, it 
follows that the throughput for each finite queue 
in the tandem is A(1 - ay). 


' The idea behind our approximation is to 
decouple the tandem network into M individual 
queues with arrival rates and service times given 
such that the analyses of the individual queues, 
an off-line analysis, will give relatively accu- 
rate results for the total queueing model. The 
approach we follow is to define a new service time 
distribution and an effective arrival rate. 


The service time distribution for the i:th 
node is found by observing that as long as the 
following node is not filled to capacity the 
service time is exponential with parameter Uae If 


the i+]:st node is filled to capacity, the effec- 
tive service time of the message at the head of 

the queue in node i is taken to be the sum of two 
independent exponential random variables with rates 
aay and rF respectively. The probability for the 
latter event is avat: 
tion can thus be represented as a two-stage Coxian 
distribution, see Figure 1. 


The service time distribu- 
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Figure 1: Service Mechanism for the i:th Node 


In the above service time representation we 
have ignored the cases where subsequent nodes ,_ 
j.e., nodes i+2, i+t3, also are filled to capacity. 


The arrival rate to the i:th node in the off- 
line analysis is set to 7 : 


M1 = ay) / (= ay). q) 


1 


The i:th node is a finite capacity Single-server 
queueing system and according to our assumptions 
a fraction 1 - a. of the arriving messages will 


be served by this queue. By using the arrival 

rate as defined in (1) we ensure that the through- 
put obtained by the off-line analysis of the node 
is the correct one, namely A(1 - a,). We approxi- 


mate the arrival process to the node with a 
Poisson process with the correct arrival rate. 
Each off-line single-server queue is now analyzed 
as an M/COX2/1 finite capacity queue. 


An M/COXK/1 finite capacity queue can be 
analyzed by defining a Markov chain [MARI 80] with 
a state-space given by the number of customers, n, 
in the queue and the service stage j in which the 
customer in service, if any, is currently 
residing. The steady-state probability of this 
event is denoted by p(n,j) and the following 
balance equation can easily be written down. 


K 
(I-rs)u; p(n,j) = a 2) p(n-1,J) (2) 


et as 
—_ 


j 
where uy is the exponential service rate in the 
j:th service stage, and rj the Coxian branching 
probability. K is the number of stages in the 
Coxian distribution. 


The conditional throughput of an M/COX-K/1 
finite capacity system can be expressed as 
[MARI 80] 


v(n) = JFI 


Ko, 24 , 
} (-rs)u, p(n,j) 
p(n) 


where 


‘ | 
p(n) = L, p(n,j) (4) 


provided that n # 0. Using (3) the balance equa- 
tion can be written as © 


v(n)p(n) = Ap(n-1) . (5) 


For a two-stage Coxian distribution the condi- 
tional throughput v(n) can be determined 
recursively by the following formulas [MARI 80]: 


h vy (1-r, )tuyu 
tie ee l1<n«<wN (6) 
Atu) tuy-v(n-1) 


(ij tase kee, qe (7) 


Hy tHy-v(N-1) 
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where N is the system capacity. Using (6), (7), 
and (8) we easily find p(n). 


In our off-line analysis every node except 
the last node is modelled as an M/COX-2/1 finite 
capacity queue. The last node is modelled as an 
M/M/1 finite capacity queue. This last queue is 
easily analyzed, and we find that 


N 


= P{n sq ee (9) 
any MOM Nyt 
l-p 
where 
A(1-a,) 
Paya ai 


The blocking probabilities Ay> Ans ves ay 


are not known but can be computed iteratively by 
using the following observation. If asa) is 


known, a. can be computed, for i = M-1, M-2, ...,l. 


ayy is by our construction a function of a, and by 
choosing an initial value for ay the blocking 


probabilities can be computed iteratively to any 
desired accuracy. 


Numerical Examples 


In this section we present numerical examples 
in which we compare the results by the approximate 
method to exact results if such are available or 
to other approximate methods or to results 
obtained by simulation. 


In the first simple example we consider a 
tandem network consisting of two exponential 
servers each with the same service rate u and each 
with the same finite capacity N. = 1, Figure 2. 


The exact results for the blocking probabilities 
are easily obtained by solving the Markov chain 
problem that can be formulated. In Table I we 
compare the exact and approximate blocking proba- 
bilities for different values of the ratio d/uz. 
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Figure 2: Network For Example 1 


We note that the blocking probabilities in 
node | are consistently overestimated and the 
blocking probabilities in node 2 underestimated. 
Subsequent examples will show that this is always 
the case. The reason for this behavior is that we 
approximate the arrival process to the nodes with 
a Poisson process. This is not a serious problem, 
Since it implies that the approximative method 
gives a lower bound for the network throughput. 


Table I: Comparison of the Exact and Approximate 


Results for the Blocking Probabilities 
in Example 1 
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Approximate Approximate 
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In our second example we investigate a three- 
node tandem queue where the first server is 
always kept busy, 7.e€., an overload situation. 
Obviously of interest here is to find the blocking 
probabilities of the second and third queue. In 
order to be able to make a comparison between 
exact results obtained from a Markovian analysis 
and the approximate method, we again select a 
fairly simple system with node capacities equal 
to one message, Figure 3. 
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Figure 3: Tandem Network For Example 2 


In Table II we show how the exact and approximate 
methods compare for different values of p> Hos 
and U3: 


Table II: Comparison of the Exact and Approximate 
Results for Blocking Probabilities in 


Example 2 


arenears 


2 
IeZ le3 


Blocking 


Probabilities | Exact | Approximate 


1.1 
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With minor modifications the approximate 
method can also be used in a network with random 
routing. In order to test the robustness of our 
method, we have used it on a simple sample net- 
work, see Figure 4. 
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Figure 4: 


Three-node Network for Example 3 


After completion of service at node 1, a 
message is routed to node 2 with probability 49 


Table III: 


independent of the current state of the network 
and with probability a4 to node 3, (a4 ta 371). 


The reason for choosing this network is that 
it was used in [TAKA 80] to illustrate another 
approximate method for analyzing queueing networks 
with blocking. In Table III we compare the 
results for the blocking probability at node 1 
obtained by an exact Markovian analysis, the 
approximate analysis due to Takahashi, et. al., 
and the method presented in this paper. 


Our fourth and final example brings us to a 
data communication network with local control and 
also external traffic imposed on the tandem link. 
In order to be able to compare our results with 
others we choose exactly the same configuration 
as the one chosen in [PENN 75]. We assume 
accordingly that the external traffic is only 
allowed to use one server in the tandem network 
and then leave the tandem, see Figure 5 
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Queueing Network Model of a Path 
According to the Model Used in 
[PENN 75] 


Figure 5: 


We use a similar approach as in [PENN 75] to 
account for the external traffic. We do however 
use our method for computing the blocking proba- 
bilities and the loading factor. The loading 
factor is by definition in [PENN 75] the frac- 
tional increase in queueing time suffered by 
external messages due to the presence of link 


Comparison Between Exact and Approximate Methods for Computing the 


Blocking Probability (PB, ) at node 1 (a45 = O19 = 0.5) 


Service 
Rates 
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Takahashi | Our 
Exact Approximate Method 


messages averaged over all external messages. The 
effect of local control as a function of the node 
capacities can be shown by plotting the loading 
factor as a function of blocking probability as in 
Figure 9, in [PENN 75]. We show in Figure 6 the 
loading versus blocking probability. If we had 
superimposed the curve presented in [PENN 75] in 
our diagram, the result would have been an almost 
overlap. Due to the fact that we do not have 
access to simulation data for this example, we 
cannot make a judgement about how accurate the 
method is. The previous examples have however 
Shown that the method presented in this paper is 
more accurate than other approximate methods. 


Conclusion 


We have presented an approximate method for 
the analysis of open exponential queueing networks 
with finite capacity. It has been demonstrated 
through several examples that the method produces 
results that are quite accurate. In particular 
we showed that the impact of local flow control in 
message-switched data communication networks can 
be analyzed by this method. We have in this paper 
constrained ourselves to open exponential queueing 
networks. The results obtained are certainly use- 
ful, but we would like to be able to extend them 
to more general networks. This is possible as 
long as the arrival processes can be modelled with 


LOADING 


Ny No Ng = (5,5,5) 
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Coxian interarrival times and the service time 
distribution with a Coxian distribution. Note 
that the Poisson arrival process is a special case 
of a renewal input process with Coxian inter- 
arrival time distribution. The resulting "off- 
line" queueing system can then be modelled as a 
finite capacity queue with Coxian input and out- 
put. Further work in this area is currently being 
carried out. The main obstacle, however, is to 
check the accuracy of the result, since no exact 
results seem to be available and simulation 
results are scarce. 
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ABSTRACT partitions which execute independently 
except for interactions through specified 
communication and data channels. 


Reconfigurable array computer architectures 3. the ability to implement a wide spectrum 
give the application designer power to define an. of interconnection geometry. 
execution architecture or architectures and _ an 
interaction geometry © appropriate to the The availability of such reconfigurable 
computational architecture of the algorithm under architectures opens new dimensions for’ the 
consideration. Accurate estimation of execution formulation of parallel algorithms. The 
times for reconfigurable architectures requires arrangement of computations can be based upon the 
determination of appropriate computational structure of the algorithm rather than upon a 
structures for the algorithm and analysis of the specific available - architecture. Resource 
cost of interprocess data movement, partitions can be tailored to the computation and 
synchronization delays and reconfiguration faults data movement requirement of the algorithm. The 
as well as actual execution time for the algorithm importance of problem specific interconnection 
in the architecture selected. This paper reports geometry is noted by Gentleman [GEN78]. He 
such a formulation of an algorithm whose demonstrates that fixed geometries can easily lead 
instruction count has previously been well to data movements dominating execution time for 
characterized, even/odd elimination of block matrix multiplication and matrix inversion. 
tridiagonal linear systems. The algorithm Problems can be formulated as sets of tasks 
naturally decomposes into three steps each of or sequences of sets of tasks (MIMD/SIMD mode of 
which requires a different computational structure computation) rather than merely sequences of tasks 
and displays a different natural degree of as is the case on the uni-processor. Each task 
parallelism. It gives a speed up linear in the set may have a different degree of parallelism 
number of processors when degrees of parallelism and/or a different interconnection geometry. 
appropriate to each step are employed. Data The execution time of an algorithm on a 
movement , synchronization and reconfiguration paraiieél computational structure depends not only 
fault costs are found to be about 10% of the upon the operation count of the computation, but 
computation costs. also the time required for data movement and the 


time lost to synchronization delays. For an 
algorithm or process with disjoint phases which 
require different computational structures to give 


1.0 INTRODUCTION an optimal execution time for each phase, then the 
time for reconfiguration of the architecture must 
The practical formulation of parallel be included in the total execution time. This 
algorithms is limited by the interconnection paper defines an algorithm for’ the odd/even 
geometry of the multi-processor architecture which elimination of block tridiagonal systems on a 
is to host the computation. Any fixed geometry of reconfigurable array computer. The algorithm is 
processor interconnection limits the class of resolved into three distinct steps, each of which 
algorithms which can be implemented. A full cross uses a different degree of parallelism and has a 
bar network removes all restrictions on algorithm different interconnection geometry. This 
formulation, but is prohibitively expensive for formulation displays advantages for the use of 
even a moderately large number of processing reconfigurable array systems with SIMD computers 
elements. A common memory architecture will assigned to blocks for the odd-even algorithm. 
suffer performance degradation from interference These are 
as the number of processors becomes large. The — 1. Each SIMD machine operates independently, 
solution to this dilemma is being sought with the therefore independent pivoting is 
development of reconfigurable interconnection possible. | 
networks to link arrays of processing elements 2. Each SIMD machine can be tailored to the 
(processors and memories). A variety of size of the block it is handling so that 
reconfigurable network architectures have been synchronization waits are minimized. 
proposed [LIP77,SIE79]. The common elements of 3. The synchronized nature of the shared 
these interconnection networks include: | data access is well suited to 
1. implementation costs which grow at a rate : intercommunication “8 mechanisms 
of n log n where n is the number of characteristic of reconfigurable 
elements to be connected [GOK73]. computers. : 


2. the ability to establish resource 
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2.0 RECONFIGURATION 


This section defines and describes the 
concepts of an MIMD/SIMD execution mode for 
reconfigurable arrays of processing elements and 


describes the modes of data movements’ which 
characterize reconfigurable network architecture 
computer systems. 


2.1 MIMD/SIMD 

Reconfigurable 
implemented as 
memory modules 
network. This 
quite different 
reconfiguration 


computers are generally 

arrays of processor modules and 
with a modular’ interconnection 
definition of reconfiguration is 
from the instruction set level 
as in the Burrough's B1700 [ORG78, 
RAU78]. The interconnection network can establish 
resource partitions consisting of a_ subset of 
processor modules and memory modules. A partition 
can be configured to implement SISD or SIMD modes 
of operation (figure 1). A job then consists of a 
number of partitions (a task is a partition) 
interconnected according to the data flow of the 
job. Processors within a partition are under 
lock-step control of one instruction stream; 
processors from different partitions are under the 
control of different instruction streams. 


2.2 Communication And Synchronization 

Two distinct kinds of data transfer 
requirements arise from the MIMD/SIMD mode of 
operation. Data transfers between partitions are 
needed both with a computation structure and 
between the structures of different phases or 
stages. A synchronization mechanism is needed to 
control the data transfers. Additionally SIMD 
machines implemented as arrays of processor 
modules must deal with the problem of data 
alignment. Consider as an illustration row and 
column access of matrices. In a pipeline vector 
processor, e.g. [CHIN72,WAT72,RUS78], it is 
possible to organize data in interleaved memories 
so that row and column access can be performed 
equally efficiently. The ILLIAC IV [BAR68] on the 
other hand is considerably harder to program for 
Simultaneous row and column access for two 
reasons: 

1. the physically distributed nature of the 
sources and targets of data. 
2. a static and limited linkage network. 

The Burroughs BSP [BUR77] avoids this problem 
by the brute force solution of using a cross point 


interconnection network between processor and 
memory modules. 

We give here an outline of the pertinent 
features of the communication subsystem 
[PRE79 ,SEJ81] in TRAC, a representative 
reconfigurable computer and show how they provide 


capability for the movement of data both within 
and between SIMD partitions. 
TRAC provides two kinds of physical channels 


for communication: packets and shared memories. 
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task address space to 


Packets are continuous streams of bytes and are 
memory—to-memory transfers. A data vector and a 
realignment vector are specified. Packet movement 
through the network is used to create a realigned 
result vector. | 

The concept of shared memory in resource 
partitioned architectures such as TRAC is not the 
same as in multiple processors sharing access to a 
common physical memory address space such as C.MMP 
[WUL72]. In these latter architectures sharing is 
on a cycle by cycle basis with a possibility of 
interference when more than one processor 
endeavors to access a given memory module. 
Synchronization mechanisms for access to memory 
are commonly implemented in software or firmware. 
The execution of these sychronization mechanisms 
consumes memory bandwidth and themselves interfere 
with the performance of the sharing resources 
(e.g. the spin-lock mechanism as described in 
[OLE78]). In reconfigurable network architectures 
such as TRAC, sharing may be combined with 
synchronization by altering the interconnection 
network to move a physical memory module from one 
a different task address 
Space. TRAC accomplishes this extended sharing 
concept by a hardware configuration with a_ shared 
module at the root of a tree whose leaves are 
processors (figure 2). To obtain a shared memory 
a processor must execute an ACQUIRE instruction. 
The processor blocks if the module has already 
been acquired by another processor. Retry effects 
only the acquisition circuitry, not access by 
other processors. 


3.0 BLOCK TRIDIAGONAL SYSTEMS 


Block tridiagonal systems of linear equations 
occur frequently in scientific computations, often 
forming the core of more complicated problems. 
Numerical methods for the solution of such systems 
are well understood and techniques tailored to the 
solution of such systems on pipelined 
supercomputers have been studied extensively 
[TRA76,MAD75 ,HEL77 ]. 

The linear system is represented as Ax=v with 


I b(1) ¢(1) i 
I a(2) b(2) e(2) I 
I a(3) b¢3) e(3) I 
I beKws I 
A = Cs eo ree rc 
I ere I 
Af a(N-1) b(N-1) c(N-1) I 
I a(N) b(N) I 
= (a(j) ,bC 5) eC j)) 
N 
where b(i) is an. x n., matrix and a(1) = O and 
| i i 
e(N) = 0. 
The odd/even elimination method (and the 


odd/even reduction method which can be regarded as 


a compact version of the former) is widely 
regarded as an efficient direct method for the 
case where the n, x n. blocks are small enough to 
be stored explicitly tHEL77]. | 

Consider odd/even elimination as described in 
Heller [HEL77], section 4: pick three consecutive 
block equations from Ax = v, A = (a(j),b(j),e(j))N 


a(k-1)x(k-2) + b(k-1)x(k-1) + c(k-1)x(k) 


a v(k-1) 
iadynenianienbeteaataeresasestkel): -- 
a(k)x(k-1) + b(k)x(k) +c(k)x(k+1) = v(k) 

Ks estasgt lena a gear en ene a a eles ranean wea eats wea) 
a(k+1)x(k) + b(k+1)x(k+1) + c(k+1)x(k+2) = v(k+1) 


Mee ieee eee Weasels eee setae eke) 


If we multiply equation k=1 by -a(k)b7 '(k-1), 
equation k+1 by -c(k)b (k+1), and add, the result 


is; 
4 | | 
(-a(k)b (k-1)a(k-1))x(k-2) 
-1 —1 
+ (b (k) = a(k)b (k-1)c(k-1) 
-1 
-~ e(k)b (k+1)a(k+1)) x(k) 
| a 
+ (-c(k)b (k+1)0e(k+1))x(k+2) 
—-1 
= (v(k) = a(k)b (k-1)v(k-1) 
-1 
-c(k)b (k+1)v(k+1)). 

For k=1 or.N there are only two equations 
involved and the modifications should be obvious. 
This operation eliminates the odd unknowns ffor k 
even and the even unknowns for k_ odd. By 
collecting the new equations into the block 


pentadiagonal system H.2x =v.2, (with A defined as 
H.1) it is seen that row elimination has preserved 
the fact that the matrix has only three non zero 
block diagonals, but they are further apart. A 
Similar set of operations is applied combining 
equations k-2, k and k+2 in H.2 to produce 
H.3x=v.3 system. This process is repeated until 
only one block diagonal remains (or in the case of 
semi direct methods, some accuracy criteria are 
fulfilled). The initial coefficient matrix H.1 
contains 3N-c non zero blocks while the final 
matrix consists of N non zero blocks along. the 
main diagonal. 


Solving the N blocks independently gives the 
required solution. 
Figure 3 shows the effect of 5 steps of 


elimination on a 16x16 block tridiagonal system. 


4.0 DATAFLOW AND IMPLEMENTATION 


In this section we will look at the dataflow 
characteristics of  odd/even algorithms. The 
computational aspects, such as operation counts 


are well understood; the communication geometry is 
studied herein and found to be regular and simple. 
= Computationally, instead of determining 
b (i) explicitly, LU factorization of b(i) is 
generally resorted to: _ 

compute LU factors of b(k), (1<=k<=N) 
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between 


solve b(k) [ a(k) c(k) v(k)] = | 
[ a(k) e(k) v(k) ] , 1<=k<=N 
b(k).2 <-— b(k).1 - a(k).1 c(k-1) - c(k).1 a(k+1) 
pee cecccees IMEK<KEN . 
v(k).2 <-- v(k).1 - a(k).1 v(k-1) - c(k).1 v(k+1) 
basse Savas ISSRCSN | 
a(k).2 <-— a(k).1 a(k-1) , 3<=k<=N 
e(k).2 <-- o(k).1 o(k+1) , 1<=k<=N~2 


4,1 Intertask Dataflow 

The sequence of actions that results in the 
computation of H.i+1 from H.i is referred to as a 
stage: in this case each stage is shown to 
consist of three steps and the steps further 
consist of substeps. 

Consider the input dataflow for computing H.2 
and v.2. In the first step, the first substep 
results in the LU factorization of b(k) ; this is 
then used in the next substep for computing a(k), 
e(k) and v(k). N way parallelism is displayed in 
this step. | 

In the second step the computation 
a(k).i+land c(k).i+1 requires a(k).i, a(k-1).i 
e(k).i, c(k+1).i respectively; b(k) .i+1 
v(k).i+1 require data from the (k-1), (k) and 
(k+1)th row equations. Binary operations are 
performed on the blocks - pairwise access to 
blocks is sufficent giving rise to upto (N/2) way 
parallelism. | 

Figure 4a shows the interconnection geometry 
needed for the second step for an 8x8 system. The 
blocks are stored in separately accessible shared 
memories one block row per shared memory. The 
diagram to the rignt of tne 6x8 tridiagonal system 
is the inter connection pattern with circles 
representing processors and Squares shared 
memories. The edges represent potential links 
that are activated as 4 separate patterns as shown 
further to the right. The new blocks computed at 
the end of substep 2 are shown in curly brackets 
the patterns of substep 2 and 3 ; the 
remaining new blocks are completed at the end of 
substep 4. 

The crucial observation here is that while 
the datasets are shared across processors, the 
sharing is conflict free within a _ substep. The 
connection pattern cycles through the states of a 
2-pole 3 position switch. 

H.2 is a pentadiagonal matrix - the 
application of an inverse perfect shuffle [STO71] 
partitions this matrix into two tridiagonal 
matrices, one consisting of the odd numbered 
coefficient blocks and the other of even numbered 
ones (Figure 4b). 

If the matrix A contains N=2**m blocks’ then 
the dataflow geometry for the next step 2 is 
represented by a graph that is a proper subset of 
the graph used in the earlier step 2 (Figure 5). 
This inclusion property is seen in every step 2 
until the block diagonal is computed. 

Thus we use the precisely same dataflow 
template in the generation of every H.i+1 from H.i 
; three steps with different connection geometries 
are needed — a direct connection, a 2-pole 


of 
and 
and 


3-position switch based connection followed by 
inverse perfect shuffle. 

A number of proposed reconfigurable computers 
can implement these and other communication 
behavior quite efficiently. Note that if we were 
tto hardwire the interconnect pattern we would be 
using a special purpose machine of limited 
applicability to other problems (Eg. the shuffle 
exchange network [STO71] in high performance FFT 
boxes). 

An implementation based on the use of shared 
memory in TRAC is now sketched briefly. The 
processors are scheduled as SIMD partitions with 
width commensurate with the block size under 
consideration. The shared datasets are stored in 
shared memory modules — each circle of figure 4 is 
realized as an array of shared memories of width 
conformal with the processor width. The time to 
Switch the 2-pole 3-position based switch is a 
critical parameter in the performance of the 
algorithm. On the basis of a 10 microsecond 
acquire time for an unacquired module 
parameter is estimated at between 50 = and 
microseconds for TRAC. : 


an 
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4.2 Performance Estimation 


The mode of operation described in the 
previous subsection consists of asynchronously 
executing processes which synchronize periodically 
to transfer data. Operations on different blocks 
may require different computation times. There 
may be, for example, different search times for 
the choice of pivot rows. Thus for a performance 
model to accurately represent this kind of 
behavior, it must be based on non deterministic 
time parameters. 

We will now present a niave analysis based on 
average time parameters as a first step towards 
developing a performance model of reconfigurable 
computer operation. 

The time for data movement depends upon the 
width of the processors and the width of the 
arithmetic. We choose a definite configuration to 
illustrate the magnitude of the communication 
costs. A 16 bit wide SISD partition will be 
assigned to each block. (If blocks are of uneven 
size a more powerful partition could be assigned 
to larger blocks.) Arithmetic will be on floating 
point numbers with 64-bit mantissas and 68-bit 
exponents. 

Let the system be 16x16 blocks and each block 
be 8x8 (total matrix dimension 128x128). Each 8x8 
block requires about 600 words of storage. A 
block row consisting of three 8x8 non zero blocks 


and a 8x1 vector requires about 2000 words of 
storage. 
The following notation is used for 


representing operation times: 
T.fpa: floating point addition 
T.fpm: floating point multiply/divide 
T.xfer: memory to memory transfer time for one 
word. 
acquisition and setup time to obtain 
shared memory. , 

The evaluation of H.5 from H.1 proceeds in 4 
stages with each stage evaluating H.i+1 form H.i. 


T.Swi: 


this. 


ye. 


The first step of a stage consists of the LU 
factorization of b which is used to evaluate a,c 
and v. The second step consists of three substeps 
that correspond to the three positions of the 
2-pole, 3-position switch. The final step 
performs the inverse perfect shuffle to position 
data for the next stage. From the discussion of 
the previous subsection it is evident that the 
first and last step display 16 way parallelism and 
the second step 8 way parallelism. 

H.5 is finally solved as 16 uncoupled linear 


systems to obtain the required solution, 

We will use the notation Pi(k,l,m..) to 
represent the State where partition Pi is 
connected to datasets k,l,m.. and dataset k 
contains a(k),b(k),c(k) and v(k). This is the 


timing for the implementation with shared memory. 
The results of this analysis are discussed at the 
end of the section and can be directly skipped to 
without loss of continuity. 


STAGE 1:Compute H.2 from H.1 


otep 1:3 
Configuration: Pa (i), Pb(2),.... 
Setup time «~. T.swi. 


» Ppt 16) 


Substep 1.1 


Computation: Pa: b(1) (LU decomposition) 
Pb¢ b(2) 
Pp: —b( 16) 


Compute time 200*T.fpm + 200*T.fpa 


Substep 1.2 


Computation: Pa: a(1), e(1), v1) 
Pb: a(2), e(2), v2) 
Pont a(16), (16), v(16) 


Compute time ~ 1200*T.fpm + 1200*T.fpa 


Step 2 
Substep 2.1 
Configuration: Pa(1), Pe(2,3), Pe(4,5) 
| uses, POCIN S15) 


Setup Time~ 2*T.swi 


Substep 2.2 
Configuration: Pa(1,2), Pe(3,4), Pe(5,6), 
SoG eead~e FOCI15< 16) 


Setup time ~ 2*T.swi. 


Computation: Pa: BCha2; alti: 
e(1).2, v(1).2 

Pe: b(3).2, a(3).2, 

e(3).2, v(3).2 


Compute time ~ 2000*T.fpm + 2000*T.fpa 


Substep 2.3 

Configuration: Pa(1,2), Pe(3,4), Pe(5,6) 
bieee ey POC IDs 16) 

Setup time ~ 0*T.swi 


Substep 2.4 
Configuration: Pa(2,3), Pe(4,5), Pe(6,7), 


cere tates FOCIO) 


Setup Time: w 2*T.swi 
Computation: Pa: b(2).2, a(2).2, 
e(2).2, v(2).2 
Pe: b(4).2, a(4).2, 
e(4).2, v(4).2 


Compute Time~ 2000*T.fpm + 2000*T.fpa 
Step 3 


Substep 3.1 


Configuration: Pa(1), Pb(2), Pe(3), Pd(4), 
Pe(5), Pf(6), Pg(7), Ph(8), 
Pi(9), Pj(10), Pk(11), P1l(12), 
Pm( 13), Pn(14), Po(15), Pp(16) 

setup time ~ T.swi 

Computation: Transfer source dataset contents 
to local buffer. 

Compute time ~ 2000*T.xfer 

Substep 3.2 

Configuration: Pa(1), Pb(9), Pe(2), Pd(10), 
Pe(3), Pf(11), Pg(4), Ph(12), 
Pi(5), Pj(13), Pk(6), Pp (16), 
P1C 14), Pm(7), Pn(15), Po(8) 


setup time ~ T.swi 
Computation; Copy local buffer contents to 
target dataset. 7 
Compute time~ 2000*T.xfer 


The important concern is the ratio of direct 
computation time to the sum of the total non 
computation time ( this consists of the various 
T.swi, T.xfer, synchronization times etc. ). Let 
us make reasonable assumption that the ratio of 
execution time for T.xfer|T.fpajT.fpmiT.swi are 
1;10;100;1000 and let T.xfer be one microsecond. 
Estimate the set-up time for T.fpa,T.fpm and 
T.xfer be equal to the arithmetic execution time. 

For the shared memory implementation the 
total direct computation time is 1034 milliseconds 
(ms), total reconfiguration time is 9 ms and data 
transfer time is 8 ms. Thus, if synchronization 
time is zero, then the overhead associated with 
mapping the odd/even elimination to a parallel 
basis is about 17/1034 or about 2h. 
Synchronization delays result solely from the 
differences in processing time for each block. 
For uniform size blocks, processing time 
differences between blocks’ will result from 
differing effort for pivot selection. This should 
not exceed 1%. Reconfigurable architectures can 
assign processing partitions with power 
proportional to block size. (SISD partitions with 
a factor of 8 variation in power for 64-bit 
floating point numbers can be constructed on 
TRAC.) Thus synchronization delays should be not 
more than 10% of direct execution time. Using 
this as an upper bound the total overhead cost in 
this formulation is approximately 12%. 


96, 


4.3 Intratask Dataflow 


of an SISD partition for each block 

problem of alternate row/column 
Row and column accessing is necessary 
because the block matrices a(k) and c(k) are 
involved in both pre multiplication and post 
multiplication. The use of SIMD partitions would 
introduce efficiences in the computational part of 
the formulation. Data distribution, would however 
become more complex. Packet movement would be 
used to realign data between  pre- and 
post-multiplication stages. This problem will be 
approached in a subsequent publication. 


The use 
avoids the 
addressing. 


5.0 CONCLUSIONS 


Reconfigurable array computers have been 
proposed as a candidate architecture for the VLSI 
implementation of supercomputers. A crucial 
motivation is that such machines provide a 
programmable rather than a fixed geometry 
communication subsystem. The ability to adapt 
communication geometry to the requirements of the 
algorithm is supposed to minimize 
non=-computational execution costs on parallel 
architectures. A parallel formulation of odd/even 
elimination of block tridiagonal systems is used 
to illustrate the effectiveness of 
reconfigurability. The mechanisms of TRAC which 
are representative of such architectures are used 
in the formulation. Data movement, 
reconfiguration and synchronization costs are 
found to be small with respect to direct 
computation costs. 

The development of 
reconfigurable architectures is 
tractable. This analysis of a parallel 
formulation of odd/even algorithms is intended to 


norallal 
IY ee ese ee 


ct 
Ss 


display a paradigm for the formulation of 
algorithms on reconfigurable array computers. 
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FIGURE 1: PARTITION ON A RECONFIGURABLE ARRAY COMPUTER 
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FIGURE 3: 5 STEPS IN THE ELIMINATION OF A 16x16 SYSTEM 
(FROM HELLER [HEL77] pp. 39) 
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ON MAPPING NON-UNIFORM P.D.E. STRUCTURES AND ALGORITMHS 
ONTO UNIFORM ARRAY ARCHITECTURES. 


| by Dennis Gannon' 
 Deparan ent of Computer Sciences, Purdue University 
. West Lafayette, Indiana. 


ABSTRACT. Adaptive algorithms for solving partial 
differential equation are studied as a means of provid- 
ing improved speed-up when, in limited processor 
situations, traditional "uniform" grid parallel 
metheds are inefficient. The difficulty with these 
methods is that the non-uniform data structures may 
not be well suited to parallel architectures designed 
for array and vector problems. In this paper the 
data-flow problems associated with a class of Multi- 
Grid algorithms are studied. It is shown that, in spite 
of non-uniform grid structures, a SIMD machine with 
an 2 network connection provides a good environment 
for adaptive computation. Time estimates that 
include interprocessor communications are derived. 


1. INTRODUCTION 


While most studies in parallel computation are 


based on algorithms designed around regular data 
- arrangements like arrays and vectors, many important 


applications are more efficiently treated with some — 


form of irregular or non-uniform adaptive structures. 
A simple example is the improvements in serial 
efficiency obtained for large sparse matrix problems 
by using linked lists and list algorithms rather than 
large two dimensional arrays. A second example is 
given by adaptive methods for solving partial 
differential equations. One of the fastest parallel algo- 
rithm (Sameh, Chen, and Kuck [9]) for solving the Pois- 
son problem 

Ou | Omu_ 
Ox* , ay" = (zy) 


requires a uniform n by n grid defined ona rectangular 
domain D as shown in figure 1. 


Vu = 


1 Research supported by the National Science Founda- 
tion Grant MCS-810951¢. 


With p® processors this method requires 

O((n/p)*log(n)) time to solve for the unknown u at 
each of the n” grid points given the "data" f at each 
point. When p=n this algorithm is the best known, but 
when n is much larger than p one may wish to con- 
sider other methods. In many cases, the data f 
represents very localized activity that can be 
optimally approximated on a pReEUIOL grid as illus- 
trated in figure ¢,. 


seed 


Figure 1. Uniform Grid 


The advantage of this "adaptive" grid is that its granu- 
larity is fine only where needed and the overall number 
of node points is greatly reduced. 


The savings generated by these techniques extend 
to parallel computation only if the architecture is rich 
enough to permit a programming of the algorithm so 
that the irregular processor communications do not 


-add to theoretical complexity of the method. There 


BE 


Pee 


Figure 2. Non-uniform Grid Adaptive Solution. 
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are two approaches to solving this problem. For rela- 
tively small numbers of processing elements, one 
attractive solution is to use data flow machines where 
regularity of structure is of smaller significance than 
volume of computation. In the case of sparse matrix 
problems recent work includes the experiments by 
Lord, Kowalik, and Kumar [6] with the HEP architec- 
ture. For the P.D.E. problem described above Rhein- 
boldt and Zave [11] have shown that the adaptive 


approach can be decomposed at the process levelina 
manner suitable for limited processor data flow com- 
putation. 


For more structured, highly parallel computation, 

the solution is to endow a regular SIMD or MIMD pro- 
cessor array with a connection network capable of 
accomodating the irregular data requests generated 
by these adaptive algorithms. Indeed, most designs 
for architectures devoted to solving partial differential 
equations, such as the Flow Model Processor (FMP) 
proposed by Burroughs for NASA, are large multipro- 
cessors equipped with a highly structured interconnec- 
tion switch. The natural question is if adaptive compu- 
tation can be shown to produce a real parallel speed- 
up, then which interconnection method provides the 
most efficient implementation? 


In the following paragraphs we illustrate that the 
problem of computing the solution to the partial 
differential equation on the adaptive grid can be 
"mapped" naturally onto a SIMD architecture consist- 
ing of an array of p*® processors and p* memory 
modules interconnected by the well known 2 switch (a 
key component of the Burroughs FMP). Furthermore, 
when applicable, the method can run as fast as 


O(k*log (p)) 
including inter processor communication where k is 
the number of levels of grid "refinement". In the 
optimal setting a uniform grid of size n by n 
corresponds to a value of k = O(lag(n/p) which, for 
the sake of comparison, yields an asymptotic estimate 
of 


O(log*( log (p)) 


2. Grid Relaxations and the Mapping Problem. 


The mapping problem can be formally defined as 
follows. Let 2, i=1,,N be the set of nodes at which we 
seek solution values u; to the PDE given the data f;. 
Let CG represent the the directed data flow graph of 
the algorithm. That is, the nodes of CG correspond to 
binary operations and the edges represent the flow of 
data between computations. A program of the algo- 
rithm for a P-processor parallel machine M is a 
decomposition of CG into disjoint sets of nodes 
(CG,,7=1..T} each of size |CG|<P such that if 
(z,y)€CG is an edge and z<€CG, and yeCG; theni <j. 
If we assume the machine is equipped with an inter- 
connection switch capable of some set parallel data 
transfers between processors, then the mapping prob- 
lem is that of choosing a set of processor assignments 
(f::CG, > M,i=1..T} that minimize the the number of 


switch settings required between the 7 computation 
stages. 
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Frequently the algorithm dictates the appropriate 
switch to give an optimal result to the mapping prob- 
lem. For example, Grosch [4] has observed that a 
large class of P.D.E. techniques can be mapped onto a 
p by p array of processors interconnected by the fol- 
lowing three level network. At the first level let each 
processor be connected to its nearest neighbor in a 
square mesh lattice. The second level connects each 
column of p processors with a shuffle connection, and 
the third level connects each row with a shuffle. 
Because separable partial differential equations can be 
easily solved by combinations of Fourier transforms 
and odd-even reductions aligned along rows and along 
columns,this Perfect Shuffle, Nearest Neighbor (PSNN) 
network is well suited for most uniform grid algorithms 
like the Sameh-Kuck poisson solver described above. 


By analyzing the data-flow of the basic com- 
ponents of an algorithm for self adaptive P.D.E. compu- 
tation we shall see how to extend the PSNN network to 
an more complex interconnection switch appropriate 
for the irregular adaptive grids. The resulting network 
is then shown to be equivalent to the 0 switch of Lawrie 
[5]). 

The algorithm studied here is a parallel version of 
an adaptive Multi-grid method designed by Van Rosen- 
dale [10]. The method is based, in part, on the work of 
Brandt [1] who has studied both parallel and serial 
implementations of the Multi-Grid idea. The guiding 
principle of these algorithms is to use a sequence of 
grids, each finer than its predecessor, to accelerate 
the convergence of more standard iterative "relaxa- 
tion’ schemes. In the adaptive algorithm, the 
sequence of grid structures is defined by constructing 
a nested sequence of subdomains of the problem 
domain D. 


DED.SD.SD. 5D; 


A uniform subgrid G is placed over each domain D; so 

that G refines G,_, by quartering certain rectangles. 

Figure 3 illustrates a simple three level refinement. 

From the sequence of uniform grids {G,i=0,4} the 

algorithm works on the sequence of composite grids 
t 


G=tUCG for i=0.k to obtain succesively better 
k=0 
approximations to the solution starting from an exact 
solution on the coarsest grid Gy. To simplify the 
description of the algorithm and its programming we 
shall assume we have P = p* processors with p a power 
of 2 and that each subgrid G, is a rectangle of dimen- 
sion k by Z with both é& andl beirig <p. (A more gen- 
eral algorithm is derivable without modifying the 


connection network constructed below, but the added 
detail provides little illumination of the basic result.) 


The algorithm (described completely in the next 
section) is built on 3 basic operations, injection, pro- 
jection, relaxation, which provide both the setting and 
a solution to the mapping problem. 

2.1 Injection. 


Given a piecewise linear solution w™) on grid G 
there exists a natural interpolate u**" on grid Gay. | 
The injection operation is this interpolation process _ 
denoted by 


au ery) 


.—~ 
oe 
. 


Injec (u™), Ge, Ge+r). 


The computation to be carried out is simple. The 
values of wu) define u®*) at the nodes of all subgrids 
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CC 
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Figure 3. Level Decomposition of Adaptive Refinement. 


of Ge; except the last fine grid G4, At the nodes of 
Ge+1 that correspond to nodes of the coarser subgrid 
G,, the value of u*) is well defined by u(), The 
remaining nodes of G**) are created by the quarter- 
section of rectangles of G* and therefore the value of 
w+) ig determined by the average of the values at 
the corners of the quartersected square. The logical 
choice for mapping the grid structure is to assume for 
the moment that the p® processors are configured asa 
Square array interconnected by a. square mesh net- 
work. In this way each subgrid can be mapped into the 


processor array, and the various stages of the data 
flow graph can be viewed as processes interacting 
between the various subgrids. In this setting, the 
interpoiation operator can be seen as taking vaiues 
from subgrid G, to values in G4,. The interconnection 
switch should map the nodes of G4, that lie in the 
embedding of G to the embedding of G,;. This is 
most readily understood by considering a one-column 
or one-row slice. In figure 4 the one dimensional view 
shows G, along the bottom 


Figure 4, Subgrid Expansion. 


row of processors. The same set of processors are 
shown along the top row but this time representing the 
embedding of G4; The lines connecting the two 
represent the identification of equivalent nodes in 
Ge+i Call the process of making this identification 
"subgrid expansion”. In this one dimensional case, a 
connection network that provides the required map- 
ping is a bidirectional omega switch on p processors 


Q,. This switch is constructed from loga(p) shuffle- 
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exchange operations. Figure 5 illustrates its perfor- 
mance on the case shown in Figure 4. 


LEMMA 2.1. Let x,,.., 2, denote the sequence of p pro- 
cessors and let 2, ...2%,+:, '=p/2 denote the indices 
of a subgrid, then the 0, network can map the subse- 
quence to any even or odd subsequence of 7),..,Z,. 


PROOF, This result follows from the equivalence of 
QO, and the Batcher bitonic merge network (see for 
example the thesis of Parker [8]). To use this 
equivalence, one need only construct the appropriate 
bitonic sequence that, when merged, maps the subse- 
quence to the appropriate place. Suppose z,, must 
map to z,. It follows that 2,4, > Yyigq, fori <¢. Define 
the remainder of the bitonic sequence by mapping the 
remaining processors Zp+s moa(p)) for s=ftl,.p-1 
according to 


Z(e+s mod(p)) > Zy+es for yt+es =p 
and 
Zlk+s mod(p)) ” “2p-1-y-2s for y+es > p. 
The effect is to extend the increasing subsequence toa 


permutation consisting of one increasing set of indices 
and one decreasing set. 


Vee Ne: 
Setanta 


Figure 5. Subgrid Expansion via Q, Shuffle-Exchange. 


To carry out the complete two dimensional 
subgrid expansion observe that it suffices to “expand” 
first along rows and then along columns. Furthermore, 
the averaging computation can be carried out by data 
transmissions along the square nearest neighbor net- 
work assumed to underlay the embedding of G,,. In 
other words, each node in G4, not lying on G is either 
the bisector of an edge or the center of a square in Gy. 
By first transmitting values along edges from those 
nodes in G(\Ge+; one can compute values along 
bisected edges. As illustrated in figure 6, a second 
broadcast pass is sufficient to propagate values to 
determine w*t!) at those node in G4, forming the 
centers of squares in G. Summarizing we have 


Proposition 2.1. Let 20, be the network composed of 
p Q, switch networks aligned along rows and con- 
nected to an identical set of 0, networks along the 
columns of ap by p processor array. If we assume 
data transmission through an () network can be done 
in one major clock cycle the time for the subgrid 
expansion is two cycles. With the addition of a 
nearest neighbor connection capable of broadcasts 
this combined 20,-NN network permits the injec 
operator to be completed in a time of 4 clock cycles. 
(A more strict interpretation of the Q, switch as a 
log(p) cyele device and the NN network capable of 
only parallel horizontal and parallel vertical data 
transmissions yields a time bound of 2log (p)+8). 


«2 Projection. 


Given a function f ®t) corresponding to the data 
for an elliptic P.D.E. problem on Get the projection 
operator determines a function f® corresponding to 
a reduced problem on G. The operator, denoted by 


f = proj (f Ce +1) Ge +11 Gi) 


has been interpreted by various authors to mean vari- 
ous things, but Nicolaides [7] has shown that the 


Figure 6, Local Data Flow for Jnjec. 
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correct interpretation is that of the transpose of the 
injection operator. If we permit 0, to be bidirectional, 
the projection operator requires the same data flow 
patterns as does injection only reversed. 


2.3 Relaxation. 


At each node 2, of a grid G the differential equa- 

tion can be replaced by a difference equation 

oe AsjUy = fs 

jES, 
where u,; is the approximate solution value at z;, S, is 
the set of nodes of G serving as vertices of squares 
containing z, as a vertex, and a,; are the coefficients 
of the finite difference approximation (the exact form 
of which will be of no concern here). The classic 
parallel relaxation step starts with an approximate 
solution u- to the difference equations above an com- 


putes an improved solution u's by a formula of the 
form 7 


' a 
Us = Us + ; fs- dy O37 U; ) 
Ogg jes 3 


for some constant A known as the relaxation factor. 
The parallel complexity of this computation is depen- 
dant on the structure ‘of the set S,. For the non- 
uniform grids described above S, can take two forms 
as shown in figure 7. 


Oe ete tie et a oR oo 
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Figure 7. S; Structures. 


In the simplest form z, is not on the boundary of two 
subgrids. In this case the nearest neighbor network is 
adequate to provide all data transmission. In the other 
case the node is on the boundary of subgrid G and the 
computation should be split with a partial result 
passed between G and G-, via the 22, network. More 


- formally, let 


Sse = (2; € S, such that mes 
and let infG, denote the nodes in the interior of G and 


ext, = Gy, — mtys; 


The relaxation procedure becomes 


Proc Relax(u', u, G ): 
begin 
For 1 := k downto 0 do 
begin : 
For gach node z,€eztG, pardo 


f A 
Us := Us + oss > > Os; U;); 
. ‘Ss JES, 
For each z,¢G, (\G-, pardo 


U's — proj (u' ’ Gi, G1); 
end; 
end; 


The exact amount of computation required 
depends on the form of the finite difference operator. 
Using the definition of S given above and the strict 
interpretation of 0 network timing the bound is 
approximately (k +1)(log (p)+16) 


24 Network Equivalence. 


By analyzing the data flow of the basic grid opera- 
tors we have constructed a network 20,-NN which 
can be viewed as an extension of the PS-NN connection 
introduced earlier. We now show 


Proposition 2.2. By numbering the P = p* processors 
by columns, the networks 20, and 0p are equivalent. 
Furthermore, The Nearest Neighbor connections (NN) 
can each be routed in one pass of the Op network. In 
terms of the time required to route an item of data 
through the network and complete one multiply and 
add operation ¥# processors interconnected by a 
bidirectional (1), have the upper bounds 

1. Injec(.,.,.) in 9 Qp-compute stages. - 

2. Projt(.,.,.) in 9 Dp-compute stages. 

_ 3. Relax (u',w,G*) in 17k Qp-compute stages. 


Proof. Number the processor in row7 and column 
Jj 8S Xtp(j-1). To prove the stated network equivalence 
we again exploit the equivalence of the Batcher Merge 
network B(P) to Qp. The former can be described as 
log (P) stages with the k™ stage defined by the proces- 
sor connections 


B,(P): 2, <---> 2 k=1,,.,.P/2 


L+P/ 2k: 
used to execute P/2 compare-exchange steps in 
parallel. Observe that for [=i+p(j-1) we have 2, is 
mapped to the processor. with index 

: 2 


| re FF 
L + oe i+p(j—-1) + oe 
si+p(j-it 2) 


which for k <= log(p) represents the index of B,(p) in 
row, Fork > log (p), we have 


ak —log (p) a p(j-1) 

which represents By -iog@)(p) in column 7. Hence the 
log(P) stages of B(P) can be decomposed into the 
log (p) stages of B(p) organized along rows followed by 
the same number of stages of B(p) organized along 


Pe os 
see a 
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columns. 


In other words, | 


To complete the mapping of 20,-NN to Op observe 
that the nearest neighbor network connection in 
column order has “horizontal shift" equivalent to "shift 
by p or —p; and "vertical shift" is the same as a shift 
by 1 or -1. But each of the latter uniform shifts are 
well known to. be executable with the N connection (see 
(5]). The set of upper bounds follow from the bounds 
derived earlier by replacing 2log(p) and each unit 
time VN data move with one Qp transmission. 


3. MULTI-GRID ANALYSIS. 
A parallel version of the "locally refined” Multi- 


Grid algorithm of van Rosendale is given by an iterative 
application of the recursive procedure below. 


Proc MG(u, f, k); 


begin : 
1 fori := 1toft do 
begin oe 
Relax(u',u,G.);  uw:=w; 
end; 
if k > 0 then 
begin 
2 fori := 1to N, pardo 
fis=f - > QijU; ; 
jes, 
3 ff ss proj(f! Ge, Ge-1)i 
4 MG(u',f®-),k-1); 
5 u:zu t+ Injec(u' ,Gy-1, G&): 
6 fori :=i+itoM do 
begin _ 
Relax(u',u, G@)i uws=w; 
end; 
7 end else solve exactly on Go 
end; | 
begin. 
(* main *) _ 
8 wu := exact solution on Go; 
k:= index of finest grid; 
9 fori := 1 tok do 
begin 
u := Injec(u, G,, Gisi): 
MG(u, f,7); | 
end; 
end. 


The number JN, is the total number of nodes in G, 
and the loop bounds ¢ and M in lines 1 and 6 are con- 
stants depending only on the partial differential equa- 
tion in a manner discussed below. By starting with an 
exact solution of the finite difference equations on grid 
Gg (obtained for example, by the Sameh-Chen-Kuck 
fast Poisson solver) the method produces a sequence 
of approximate solutions for each grid G for l=1.4, 
The running time for this algorithm is derived as fol- 
lows. Let 7, be the time for one call to MG(,,.,4) in 
terms of network communication-computation steps, 


where in this section we assume one pass through the 
switch is log(P) steps and the computation step as 
unity. From our previous analysis we obtain the 
recurrence relation 


Ty = t((k+1)(log (P)+16)) + S + (log (P)+8) 
+ Ty, + (log (P)+8) + (M—t)((k +1)(log (P)+16)) 


where the summands on the right come from lines 1 
through 6 of the program and the residual computa- 


tion (time S) is similar to Jnjec. Solving the 
recurrence gives 
_ Mk? 
T, = “= 20g (P) + O(klog (P)) 


The main block calls MG(*) as indicated above and 
resulting in a total time of O( Mk %lag (P)) 


In order to arrive at an upper bound on the time 
required to compute a "final" solution, one must ask 
when an approximate solution to a set of finite 
difference equations that approximate a PDE is an ade- 
quate approximation to a true solution to the PDE. For 


a given grid structure G let s, denote the difference 
(in the mean square sense) between the true solution 
of the partial differential equation and the exact solu- 
tion to the differential equation. Simply stated, the 
main result of the numerical analysis [10] is that there 
exists a constant C independent of G such that if M > 
C then the difference between the approximate solu- 
tion produced by the MG algorithm and the true solu- 
tion of the PDE is less than ge. To compare two 
methods we must compare the computation time to 
produce solutions of comparable accuracy. While the 
comparison of this version of parallel Multi-Grid to the 
fast Poisson solver is the currently the subject of 
several numerical experiments that will be reported on 
later, it is possible to give a rudimentary analysis of 
expected performance, 


Under optimal conditions on the initial data f and 
the PDE being solved, the truncation error ¢&, is for a 
non-uniform grid G is comparable to an by n uni- 
form grid when k =c(log(n/p)) for a small constant 
c <2. In such favorable circumstances we expect per- 
formance of 


O(k®log(P)) = O(log*(n/ p )log (P)) 


which compares well with the bound = of 
O((n/p)*log(P)) for a fast Poisson solver on the uni- 
form grid. 
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A PRACTICAL ALGORITHM FOR THE SOLUTION OF LOWER TRIANGULAR SYSTEMS ON A PARALLEL PROCESSING SYSTEM 


Robert K. Montoye and Duncan H. Lawrie) 
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Urbana, Illinois 61801 


ABSTRACT 

An algorithm is presented for a more efficient 
and implementable solution of lower triangular sys- 
tems on a parallel (SIMD) computer. Additionally, 
this algorithm has been mapped to a hypothetical 
machine with as many memory units as processors, an 
Q alignment network, and a control unit that can 
generate P-ordered memory addresses. Assuming that 
L is a unit lower triangular system of order N, the 
system can be solved in T arithmetic operations: 


using P = processors, 
if r < > T = OW ene *) 
| 2 l-r/3 2/3 
ifs-<r<¢ 3, T O(N xlog (N)). 


The data is directly accessible in the evalua- 
tion step and can be moved to a location where all 
required data for the inversion step can be 
accessed. The memory/processor connections are 2& 
passable and the processor/memory connections are 


g} passable. Preliminary error results of a FOR- 
TRAN simulation indicate correlation between this 
and the serial algorithm for both stable and 
unstable problems. 


1. Introduction 


This paper shall discuss the limited processor 

solution of unit lower triangular systems: 
(Lxx = £ with L of order N). 

Time (T) is measured in terms of the number of 
operations that can be performed using up to P pro- 
cessors performing a single operation on different 
data and is proportional to the time required for a 
system to execute the algorithm considering both 
access and alignment penalties. 


In previous papers on this topic: 
Chen & Kuck’s [ChKu74] "product form" proved: 
L x x = f could be computed with 
P = O(N?) T = O(log *(N)). (1) 
Hyfial & Kung [HyKu74] used this with problem par- 
titioning to show that using P = N’ processors, 


with 3 < r < 3 T = o(nt 2M x(og()) 
with r < 2 = O(n? * xLog(N)). (2) 
Also of interest is the ee of "column sweep" 


or direct forward substitution method [Kuck76]: 
P = O(N) = O(N) (3) 


This research was supported in part by: 
National Science Foundation grant MCS-81-00512 
and Dept. Energy grant US DOE DE-ACO2-81ER10822 
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To produce a limited processor algorithm, the 
system is partitioned into s=N/w blocks of width w: 


XL, .Xx, = f. I1<i<s 
j=1 1,J a | —1 
i-l 
This has the form L. , x x,- ZT L, .xx, 1 <i<€<s. 
~ d,yi =i j=l i,j -j 
as eng are w element vectors and L. is wXw. 
> 
IL, 1 | Ix, | lf 
F 
aber | fel 
ial a age : . ee - ea! 
Pet oe Le eoee Pie | Ix, | If. | 
Figure i - Hyfial & Rune’ Ss Partitioning 
The partitioned system can be solved by seri- 
ally producing each new i. = ee j x xX, ; then 


xX, using w= wt. pi/3 


[ChKu74] producing (2). 


Considering that the product form is performed 


by explicit cereulgt ten of x, = L, VE» inversion 
’ 


(which adds an O(log” (N)) delay) can be performed 
for all L, “4 in ace increasing the utiliza- 


tion by moving the OGes (N)) term outside the loop 
on se. The evaluation phase is then a matrix multi- 
plication, producing an O(log(N)) delay. 


Solve for all the L ie 
se 
DO i= l,s -] 


x,=L, .*£. 3 
=i i,i -i 


DO j = itl,s (* in parallel *) 


Ee ay 
END ; j J, 


solving the recurrence for x 


and the "product form" 


in parallel ; 


END; 
a) The s Py 


> 
their constituent elementary matrices in: 


can be produced by products of 


1< (5 ) + O(log” (w)). 


b) The resulting system can be solved in: 
T < 2xsxlog(w) + o(n?*), 


c) Choosing w= (nT /3 x108 21)? produces: 
if r <7, T = ON 
a: 2 1 =? 2/3 (4) 
if 3< r < 3, T = O(N xlog (Ny) 
which compares favorably with (2) above. 


Additionally, its limiting cases are: 
N 
the results of (1) at P = (ioe ary? and 


O(N). 


the results of (3) at P 


2. Storage Scheme 
The data for the problem is modeled as a two 
dimensional array stored across the P memories in 
column major order. Thus elements within a column 
are in consecutive memories and elements in adja- 
cent rows are in memories that differ by the column 
length. The processor assignment will be described 
in this same light. Assuming A is dimensioned 
A(N,N), any two elements A(R_,C_) and A(R, Cc ), are 
; , . a b 
both accesible if and only if: 
((C_-C, )x N + (R_-R_)) mod P # O. 
A sufficient*condition is’ that: 
}(C_-C, )x N + (R_-R,_)] < P 
as X and X + 6 erat be differen mod P if |6| < P. 
Any set of elements whose linearized distance 
between all pairs is less than P is accessble. 


3. Alignment Network 


The networks used for data alignment (Q for 
input and 2 for output) have been extensively 
studied in [Lawr75], [Wen 77], [Yew 81]. The 
results needed in terms of the source-destination 
pairs that will pass a given network of P input and 
output ports will be listed here. 


1) A mapping will pass an Q network if for all 
1.4% (s, -~ s.) mod P < (d, - d.) mod P. 

2) A mapping will pass an Q. 
L543 (s, -~ s.) mod P > 


network if for all 
Cd ~ d.) mod P. 
3) Both networks are partitionable. 
If the source and destination locations are 
partitioned into blocks of size 2~, the map- 
ping is passable if both the mapping within 
each partition is passable and the mapping of 
the partitions in the system is passable. 


4. Matrix Multiplication 
The matrix-matrix multiplication operator per- 
forms all the arithmetic for the algorithm. This 
operator can be scheduled to maximize efficiency 
and minimize memory and alignment conflicts. The 
matrix multiplication of A (a by 8) x B (B by y) 
producing C (a by y) requires the summation of a 
dot product of length ff for each element of C. If 
a, 6B and y are all powers of two and B<q, the 
operation can be aligned using the 2 and 2 net~ 

work. Further details are in [Mont8l]. 


3. Parallel Inverse Calculation 


Inversion is accomplished using the identity 
from [Hous64] that the inverse of a unit lower tri- 
angular system is expressable as a product of the 
inverses of its constituent elementary matrices. 


M. =M. . = I- (L-I) xe; 
+ i Li i 


i; = M xM xM XeeeXeeeXM, XM. XM 

n-l n-2 “n-3 3 2 0°51 
There are s such systems of order w to be 
inverted. Each system will be fanned in with a 
binary tree similar to the technique used by Sameh 
& Brent [SaBr77] in which each level (2) of the 


tree doubles the number of elementary matrices in 
one product (2%) and halves the number of products 


(piosw)-2) being formed. Each elementary matrix 
product involves the multiplication of a matrix of 


X-1 X-1 


size w by 2 by a square matrix of size 2 and 


the addition of a (w by gly matrix to the result. 
To efficiently implement this operation, the 
matrices to be inverted are extracted to a (w by N) 


array. This allows the data for the inversion to 
be accessed in at most two passes. 


Step 2 should take 
logw-2 Dae 
t, g 2fisx@ =) Cust a re ies"): 
The time to solve for all inverses is 
log(w) 


3 
sXw 2 
T < > Ty < xP + O(log’ (w) ) 


6. Sol¥ing - the Partitioned System 


After the inverses have been computed, the 
next w elements of the x-vector can be solved and 
the rest of the f-vector can be updated in two 
matrix multiplications: 


mi tee 


is a matrix (w by ) vector (w by 1) product. 


Assuming that P ns T = 2 + log(w). 
The most processor Satensive step of: 
f. = f.-L, ,xx, 
ce —J Js,i tl 
is a matrix (N by w) vector (w by 1) product: 
This should take: T < 2p] + log(w). 
The total over all s, 
T < 2s x (2 + log(w)) + 2xN 
Since both the inverted submatrices and the 
matrix for updating are stored in column major 


2-r 


order in partitions of their own dimension, they 
are accessible and alignable. 
7» Total Solution Time 
By choosing w = pl/3, tog! /3 (Nn), the unit 


lower triangular system with L of order N can be 
solved using P = processors in time: 

if 37 Ce T = o(N-*) 

if =< r < 3, T = (nt t/3 x1 097/3y) 

using the algorithm previously discussed. The fol- 
lowing graph compares Hyfial & Kung’s algorithn, 
product form with folding [Kuck76], and the current 
algorithm for N = 500 with processing time = 2, 


alignment time = 1, and memory time = l. 
6000. 
[Hyku74]>} Folded 
Sooo. \ Product 
\ Form 


Current 


4000. Algorithm 


5800. 


Men 


2000. 
1000. 


2. 
129 191 197 195 194 195 196 197 
PROCESSORS 


Figure 2 — Comparison between various algorithms. 
§- Numerical Experiments 


A program that simulated the arithmetic 
involved in this algorithm was written in FORTRAN. 
The array indices on the most parallel steps were 
linear combinations of 5 do-loop indices in paral- 
lel (indicating that a control unit that could gen- 
erate P-ordered vectors of depth 5 could produce 
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the results shown here). The results are obtained 
by comparing double precision serial solutions with 
the results derived with this algorithm. The par- 
tition width was reduced to w=l1 at P=N to force 
"column sweep" [Kuck76], known to be computation- 
ally equivalent to the serial algorithm, to allow 
comparison with serial algorithms. 


The graphs that follow are generated in the 
following manner. Using a specific lower triangu- 
lar matrix, an x-vector is generated using a uni- 
form distribution (+1). The matrix. is multiplied 
by this vector to produce an f vector. The differ- 
ence between the solution generated using the pre- 
vious algorithm and double precision column sweep 
is the error. This value is plotted for N = 64, 
comparing single precision column sweep (using w = 
1, P = 64), inverting matrices of size 8 with P = 


256, inverting matrices of size 16 with P = 1024, 
and inverting matrices of size 64 with P = 16384. 
The number of digits of accuracy is’ then 


~log;o(relative error). The number of digits of 
accuracy is then averaged for each element for 25 
cases of random x-values with the same matrix. The 
first matrix represents the 3-term recurrence of 
the Chebyshev polynomial T(x) for x=2. 


1s ae = f. + 4x at Pe | = ae ny de 
Dp ee —— P=64,w=l 
a ---- P=256,w=8 
1 10 +H+ P=1024,w=16 
4 8 P=16384,w=64 
5 «6S 
F 4 
: 2 
: 2 
1-2 
a ~4 20 42 6D 
2 | 
e 12 3o So 72 
COMPONENT 


Figure 3 - Comparison of errors for unstable matrix. 


The second example is the solution to: 


a | = f. — l. XX sy + 0. XX sos 


It is a well-conditioned problem as its character- 
istic roots lie within the unit circle. 


14 
7 
g le 
rT if 
S 
6 8 
F S ——— P=64,w=1 
Pp ---- P=256,w=8 
2s 4 ++H+ P=1024,w=16 
C 2 ——— P=16384,w=64 
7B . 
iN "4 22 42 60 
12 52 52 72 
COMPONENT 


Figure 4 - Comparison of errors for a stable matrix. 
9. Conclusion 

The method of problem partitioning (first 

appearing in [HyKu74]) has substantial potential 

when the additional parallelism that such a method 

allows is exploited. The parallel inverse calcula- 
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tion places the O(log2(N)) delay associated with 
matrix inversion outside of the seriality of solv- 
ing the partitioned system. ane allows: 

a) The dominance of the O(n? Py term for PLN 


b) ©The selection of wider partitions 
1/3 2/3 


otf 


(N) and as a result: 
a o(n tt /3 xr0p7/3 qn). 

This Bien should allow more practical 
solution of larger general unit lower triangular 
systems with limited processors. However, of 
greater concern is the fact that the alorithm can 
be implemented on a parallel processing system with 
as many memories as processors and an efficient 
alignment network connecting them. Additionally, 
the control requires only generation of P-ordered 
vectors for memory accesses and lockstep (SIMD) — 
processing. A data storage scheme that allows the 
algorithm to be executed and data accessed in the 
same order of time as the theoretical result, by 
using a scratch data area to store the matrices to 
be inverted, has been shown. Finally, a small set 
of experimental error results have been shown to be 
close to the serial results. In conclusion, a 
practical, limited processor algorithm for the 
solution of unit lower Entanguler matrices has been 
demonstrated. 


w= Po — 
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A PIPELINED DIGITAL ARCHITECTURE FOR COMPUTING 
A MULTI-DIMENSIONAL CONVOLUTION* 


K. Y. Liu 
Jet Propulsion Laboratory 
California Institute of Technology 


Pasadena, CA. 


Summar y 


Two-dimensional (2-D) cyclic convolutions 
have found many applications such as image pro- 
cessing [1] and synthetic aperture radar (SAR) 
processing [2], etc. The major problem when one 
uses the conventional FFT technique to compute the 
2-D convolutions is that complicated matrix trans- 
pose operation must be performed. To alleviate 
this problem, several authors [3], [4] have sug- 
gested that efficient algorithms using polynomial 
transforms can be used to compute a 2-D convolu- 
tion. Recently Reed et al. [5] extended the 
results given in [3], [4] and developed an effi- 
cient algorithm using the radix-2 fast polynomial 
transform (FPT), the fast Fourier transform (FFT), 
and the Chinese Remainder Theorem (CRT) to compute 
a 2-D cyclic convolution. This FPT-FFT-CRT algo- 
rithm requires fewer multiplications and about the 
Same number of additions as the conventional FFT 
approach for computing a 2-D convolution. In [6], 
the author and Reed et al. proposed a parallel, 
pipeline architecture to implement this new algo- 
rithm for real-time SAR processing application. 


In this paper, the work in [5], [6] is further 
extended to derive a pipelined digital architecture 
composed of modular FPT, FFT, and CRT computational 
units for efficiently computing a 2-D convolution. 
The extension of this machine concept to effi- 
ciently compute a multi-dimensional cyclic convolu- 
tion is also presented in this paper. 


Let at, it and bey, be two dj x dg arrays, 


Then the 2-D 
t1,t2 can be 


expressed as a one-dimensional polynomial convo- 
lution [3] 


2 
where 0 < ty <d; - 1 for i= 1,2. 


cyclic convolution of ati ,t9 and b 


d,-1 


| d 
e | 2 
a -)) Pore Bent) mod € - : (1) 


t,=0 


for 0 < nj < dj - 1, where (ny - tj) denotes the 
residue of (n, - t,) modulo d.- 


*This paper presents one phase of research 
conducted at the Jet Propulsion Laboratory, 
California Institute of Technology under Con- 
tract No. NAS/7-100 sponsored by the National 
Aeronautics and Space Administration. 
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In (1) the polynomial C,,(t), At, (Z) and 
1 1 
B(ny-t1) (4) are of the form 


d,-1 . 
A. (Z) = ) a Zz? (2) 
t are & 
a a, ee 
2 
m 7 
If dp = 2, then one can express Z - 1 as 


a product of (z92/2 +1) ana(z?2/2 


=a Lyng 

Since these two factors are relatively prime, by 
the Chinese Remainder Theorem (CRT) for polynomial 
[7], the polynomial congruences 


nas c,, (2) mod (2°22 + :) (3a) 
and 
Cas c,, (2) mod ("2/2 = :) (3b) 
have a unique solution 
c, (2) = oz) (- 7) Gazz 2 , 
+ oe) (3) (,2/2 + 1} 
mod (.*2 “ 1) (4) 


Thus we have decomposed a d9-point 1-D polynomial 
convolution into two d9/2-point 1-D polynomial 
convolutions. Note that in (4) the arithmetic 
required to compute Cy; (Z) from C1 (1) (Z) and 
C9(1)(Z) requires only cyclic shifts and additions. 


Applying the same technique to the factor 


d2/2 
(Z ~ 1) yields the following congruences 
d 
GY @y Bes @yemed (, ae’ 3 (5a) 
(2) @ “2/4 
C55 (Z) = C, (Z) mod \ Z - 1 (5b) 


which can be solved by an equation similar to (4). 


If one uses the transformation Z = w u, given in 
[4], where wy is a d2/2th root of -1, on C,) (2), 
then (3a) can be expressed as 


d 

G(s) = ey? (wy) = 6, (wyu,) mod (s,2?? =) 
Thus Cn Gi) can be computed similar to the case 
given for Co (1) (2). Cz ¢ ) (2) can be obtained by 
the inverse transformation uj = wz on ¢, 0 (uz). 
Thus we have decomposed a d9-point 1-D polynomial 
convolution into four d2/4-point 1-D polynomial 
convolutions. 


If one repeats the above procedures, then one 
can decompose a d7-point 1-D polynomial convolution 
into 21, d2 /2i-point 1-D polynomial convolutions, 
where i is the level of decomposition. Thus in the 
computation of a 2-D convolution, the input poly- 
nomial Ay,(Z) is decomposed into 2+ polynomials 
Ay, (1 (Z) by moduloing the appropriate polynomials. 
Each of these polynomials is then convolved with the 
corresponding polymial Bin #4) @) obtained like- 
wise from B(nj-t1) (4)- The results of these poly- 
nomial convolutions are then combined using the 
Chinese Remainder Theorem to form the final result 
Cy, (2). Since the above technique uses 1-D poly- 
nomial convolutions of identical size, modular 
polynomial convolution and Chinese Remainder 
Theorem computational circuits can be used as basic 
building blocks to implement a 2-D convolution sys- 
tem. Moreover, since the computation of these 1-D 
polynomial convolutions are independent, these con- 
volutions can be done in parallel. 


Theoretically, one can decompose a long 2-D 
convolution into many small and identical poly- 
nomial convolutions. However, fast algorithms may 
not exist when computing a small polynomial convo- 
lution of arbitrary size. It was shown in [4] - [6] 
that when dg = 2™ and dy = 2m-rtl for some r, l<r 
<m, a fast polynomial transform can be used to 
compute the 1-D polynomial convolutions. Thus when 
the decomposition level is equal to k, where 1<k 
< r, one can use the fast algorithm presented 
above involving FPT and FFT to compute the 2k, 
d2/, k-point polynomial convolutions. Of course, 
when k = r, one can use 1-D polynomial convolutions 
of the smallest size to compute a d, x dyg-point 2-D 
convolution. 


As an example the computational flow diagram 
of a dj x dg-point 2-D convolution, where dj = 
gm-r+1 and dg = 2M with r = 2, is shown in Fig. 1. 
Note that the maximum possible decomposition level 
r = 2 is used. Hence this 2-D convolution is 
decomposed into 4, dj /4-point polynomial convolu- 
tions, where each of the polynomial convolutions 
is computed using the fast algorithm discussed 
above. A pipelined architecture to implement this 
example is shown in Fig. 2. A detailed description 
of this architecture is given as follows. 
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In Fig. 2 the input data is coming in serial 
word-by-word along the d2 direction, i.e., consecu- 
tive d2 words are considered as one line along the 
d2 dimension in a djxdj array. The input is con- 
trolled by a switch. During the first half of the 
d2 points, the switch is in position 1. During the 
second half of the dz points, the switch is switched 
to position 2. Thus the second half of the d2-point 
data is added and subtracted with the first half of 
the dg-point data to perform the polynomial modulo 


(2792/2 _ 1) and (7292/2 4.1) operations required by 
the first level of.the convolution decomposition. 
The same technique is applied to the two branches 
of the second level of the convolution decomposi- 
tion except now a delay of do/4 is needed to per- 


form the modulo (2922/4 _ 4) and (292/444) opera- 
tions. Also at proper branch of the second level 
of the convolution decomposition, multiplication 
by wy ys where £1; = 1,2, ..., d9/2, is performed 
on the input data to perform the transformation 
Z= w1xul. The output of the second level decom- 
position is fed into the 1-D polynomial convolu- 
tion which consists of a pipelined FPT [6], a pipe- 
lined FFT [1], a multiplier, an inverse FFT, and 
a pipelined inverse FPT. The constant filter 
coefficients Br, 6K) (2) is read out from a table 


and multiplied with the FFT outputs. 


The Chinese Remainder Theorem (CRT) computa- 
tional units shown in Fig. 2 to compute an equation 
of the form given in (4) can easily be implemented 
by delay lines and adders. From Fig. 2 one can 
see that a FFT butterfly type of circuit [1] and 
serial memories can be used as the basic building 
blocks to implement the system. With the advent 
of VLSI technology, such building blocks can easily 
be implemented on VLSI chips. 


The about technique and architecture for com- 
puting a 2-D convolution can easily be generalized 
to compute a multi-dimensional convolution of 
dimension greater than 2. Let the input data be 
dj x d2 x -- x dp arrays, Then it can be shown 
that a fast algorithm similar to the FPT-FFT-CRT 
algorithm discussed above exists if dj, dos “———, 
d, satisfy the following condition: 


mo Sreer m m,-r,tl 

m ie, Se os eke eae a we 

dy = 2", d= 2 =2°,4 ,=2 
m 
m -r PL. n-1 
mn, dy =2 n-2 n-2 ~ = 2 , d, 
m1 n-1t 
= 2 
where 1 < r, <m, for i= 1,2,...,n-1. 


[1] 


[2] 


[3] 


[4] 


[5] 
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REAL-TIME LISP USING CONTENT ADDRESSABLE MEMORY * 


Jeffrey G. Bonar and Steven P. Levitan: 
Department of Computer and Information Science 
: - University of Massachusetts 
Amherst, Massachusetts 01003 


Abstract -- The dynamic data structures of 
LISP require periodic garbage collection, 
prohibiting the use of most LISP implementations 
for real-time applications. 


for implementing a real-time LISP system which 


uses Content Addressable Memory (CAM) to allow 
incremental garbage collection. In our’ scheme, 
all basic LISP operations, notably including 


retrieving a free cell for CONS, the list building 
function, and retrieving a current name—value 


We propose a scheme. 


binding, can be implemented with four or fewer CAM 


searches and very little other computation. 
Furthermore, CAMs are well suited for sufficiently 
inexpensive implementation with VLSI technology. 
Our system is not suitable if a virtual memory 
environment is needed, and becomes considerably 
more complex with CDR-coding. We are currently 
implementing a version of our’ scheme on a 
microcomputer. : 


Introduction 

There are many real-time tasks which lend 
themselves to Artificial Intelligence (AI) 
solutions. Examples include assembly line robots, 
rapid transit system controllers, many complex 
scheduling tasks, and intelligent assistants for 
interactive devices. Such systems will most 
likely be designed and tested in LISP. The 


flexibility and expressibility of LISP have made 
it the "work—-horse" language of the AI community. 


Can the prototype systems, still written in LISP, 
then be transferred to the final "production 
model"? 


We feel they can, but not with a standard 
LISP implementation. : 


The dynamic data structures of LISP require 
the use of "garbage collection" to reclaim memory 
as the data structures of the program grow and 
shrink. Garbage collection is typically done in a 
two phase process of first tracing and marking all 
active data, and then collecting all unmarked 
data. Depending on the size of the memory this 
Operation can cause serious delays in processing. 
These delays can occur any time the program needs 
a new free cell. In particular they could occur 
during time-critical applications. An alternative 
Space management scheme, reference counting, is 
unacceptable because it allows unbounded delays 
whenever a cell is released to the free list. 
This is because all successors of the released 
cell could become garbage and would have to be put 


* Support for this work was partially provided by. 


the Army Research Office under grant 
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reasons a 


that the time required by each of the 


on the free list at the same time. For these 
standard LISP implementation is not 
considered acceptable for real-time environments. 


In this paper we discuss a real-time LISP 
implementation. Various LISP machines (e.g. 
Greenblatt [7] and Deutsch [4]) although 
usually presented as personal computing tools -- 
have shown that special purpose processors can 
vastly inerease the speed and utility of LISP 
programs. Our paper shows how special purpose 


associative memory can be used to gain additional 


benefits. 


define a real-time 
having "the property 
elementary 
Operations is bounded by a constant independent of 
the number of cells in use", Baker's real-time 
LISP system involves incrementally compactifying 
and. linearizing active cells by moving them 
between two memory partitions while leaving the 
garbage behind. Wadler C11] analyzes and 
summarizes a real-time scheme involving two 
processes running in parallel: the mutator is the 
application program while the collector keeps the 
free-list from becoming empty. 


Following Baker [2] 
list processing system 


we 
as 


Our scheme uses specialized hardware, Content 
Addressable Memory (CAM), to create a very fast 
real-time LISP system, using avery simple set of 
algorithms. This speed and simplicity, which are 
the advantages of our scheme, are due directly to 
our use of CAM to examine all cells in memory in 


parallel. 


After 
its 
discuss 


discussion of CAM. 
presenting our real-time LISP scheme, 
limitations are discussed. Finally, we 
our implementation of this scheme. 


We begin with a 


. Content Addressable Memory 


General Description 


Content Addressable Memory 
organized such that each word can compare its 
contents, rather than its address as in random 
access memory (RAM), with a value broadcast by the. 
central processor [5]. This comparison process is 
done by all CAM words simultaneously. The 
processor can then interrogate the CAM to discover 
which words, if any, match the broadcast value. 


(CAM) is memory 


Each word of a CAM memory has an associated 
responder bit (see figure 1). This single bit is 
reset if the contents of the word do not match the 
broadcast value, held in a register called the 


COMPARAND 
RESPONDER 
BITS 


CONTENT 
AUTRE SSHBLE 


Jd WORDS LONG 


MEMOS 


Sr BITS WIDE ——————~ 


SOME-NONE 
BIT 


Figure 1. CAM Organization 


comparand. All responder bits are typically OR'ed 
together and their disjunction is available to the 
processor as the signal SOME—NONE. Using 
SOME-NONE the processor can determine if there are 
any words that match the comparand. Additionally, 


a function to count the number of responders is 
often provided. 


Another function the responder bits provide 
to the processor is to allow it to select a single 
responder if more than one exists. This is done 
by daisy-chaining the responder bits such that 
when the signal SELECT-FIRST is generated by the 
processor only the first responder in the chain 
remains set and all the others are reset. The 
processor: can also perform the function SET-ALL 
which sets all the responder bits true. This is 
usually done before the comparand is broadcast to 
the memory. 


Along with the comparand the 
broadcasts a mask value. This is used by the 
words of the CAM to determine which bits of the 
word .are to participate. For bits in the word 
where the mask bit is not set, no comparison takes 
place. The full operation is: 


for all Words J 
for all Bits I in Word J 
Responder _bit[J] <- 
Responder _bit[J] 


and | 
~ ( ( Mask bit{T] 
and 
‘CAM bit{I,J] = Comparand bit[TI] 
) 


or not Mask bit[T] 
) 


Note that this operation takes place in all words 
in parallel. . ‘ 


Ld3 


- are not inconceivable [10]. 


processor also. 


The processor can also perform the operations 
READ—-RESPONDERS and WRITE-RESPONDERS. These allow 


the processor to read the contents of and change 
the contents of all words whose responder bits are 


set. This operation is often implemented to be 
under the control of the mask. Finally it is 
often convenient to allow the processor to access 


the CAM as a regular RAM and allow reading and 
writing of single words. 


Suitability For Very Large 
(VLST) 


CAM is well suited to VLSI implementation. 
Foster [6] and Mead and Conway [10] both discuss 
the practical design of a VLSI CAM circuit. Two 
~f the most important criteria for determining if 
a circuit can be implemented efficiently in VLSI 
are the regularity of circuit components and the 
number of input/output pins necessary [10]. CAM, 
like RAM, has an inherently regular sub-structure: 
the word. 


Seale Integration 


To minimize the pinout (the number of 
input/output pins needed) several techniques can 
be used. First both the comparand and the mask 
values can be broadcast to the CAM in a bit serial 
protocol. This would mean that comparisons are 
done one bit at a time across all words in 
parallel. Bit serial operation would slow down 
the comparisons somewhat, but only on the order of 
the number of bits in a word. (a) 


To minimize pinout further, the data in, data 
out, and address lines of the circuit can be 
multiplexed onto the same pins of the package. 
This technique has been used _ successfully for 
other types of VLSI circuits, for example, the 
Zilog Z8000 microprocessor. Minimizing the number 
of pins (and output drivers) would significantly 
reduce the cost of the circuit and increase the 
area available for storage. 


The cost of CAM has been estimated to be 1.5 
to 3 times the cost of an equivalent size RAM [6]. 
Memory sizes up to 64k of 32 bit words per circuit 
Printed circuit cards 
containing 4k bytes of CAM have been on the market 
since 1978 [8]. | 


Finally, CAM architectures lend themselves to 
a solution of the yield problem for VLSI. The 
problem is that a single flaw in one place of a 
VLSI eireuit will cause the whole circuit to be 


unusable. As the physical area of VLSI circuits 
increases, so does the the probability of a flaw 
ruining a given circuit [10]. Sinee CAM 
operations, unlike RAM operations, do not depend 


upon where in memory a particular value is stored, 
it would be possible to disable flawed words of a 
CAM circuit, after testing, and still use _ the 
resulting (smaller) memory. | | 


(a) The time per bit would be on the order of 10 
nano-seconds. Therefore, even with bit serial 
operation, with reasonable word lengths, the time 
for a CAM operation would be on the order of the 
time for a inachire instructior. 


The Ideal CAM for 


A I ee <eene 


LISP 


For most applications CAM words are quite 


long. The Semionics CAM, for example, has 256 
bytes (2048 bits) per word [8]. This allows 
entire records of data to fit in one word. A 
record might contain an employee's name, address, 
telephone number, pay rate, regular hours, 
overtime hours, etc. This would allow searching 
on any field of the record to retrieve it. 
Although there are standard techniques for 
spreading records across two or more CAM words, 


this slows the search considerably [6]. 


An ideal CAM for LISP has much shorter’ words 
Since it is desirable to have only one LISP cell 
per CAM word. We discuss several types of LISP 
cells below. Here we concentrate our discussion 
on list cells which have seven fields: Flags, 
Garbage, Cell type, Left, Left type, Right and 
Right type. 


CAM 
and 


The Flags field is used for complex 
searches involving logical disjunction 
eonjunction of different match criteria [6]. The 
bits in the Flags field are used as "temporary 
storage" for the responder bit of each word. The 
Flags field could be replaced by several auxiliary 
responder bits for each word and CAM operations to 
logically combine them [6] [8]. 


The Garbage field need be only one bit, 
indicating if the cell were "free". Using this 
bit we completely dispense with the Free list 
found in most LISP implementations. 


The Cell type field indicates if the cell is 
a list cell, a string cell, or any one of a number 
of other types. We discuss this in detail later. 
The Cell type will facilitate any desired strength 
of typing and also allow cells of different types 


to share the same memory’ space (without 
partitioning) and the same garbage collecting 
scheme. 

The Left_type and Right type fields will also’ 
enforce typing. They allow us to_ pack short 
integers, bit strings, and pointers to machine 
language code into the cell. In addition they 


Simplify the garbage collect process by 
us to test whether a 
pointer. 


allowing 
given Left or Right is a 


The Left and Right fields would, as usual, be 
large enough to point to any other cell in memory. 
That is, a memory with 2**n CAM words (cells) 
would require n—bit Left and Right fields. 


The CAM operations that need to be supported 
are SET-ALL,: MATCH, SELECT-FIRST, SOME-NONE, 
~READ-—RESPONDERS, WRITE-RESPONDERS, READ, and WRITE 
as outlined above. The COUNT-RESPONDERS is not 
necessary. Additionally, for the name-value 
binding scheme outlined below, a FIND-GREATEST 
function would be helpful. 
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Real-Time List Processing with CAM 


RE OTLEY | ERODE 


The Algorithm on a Simplified LISP CAM 


ETRE NT a AT NNN, NTE 


We begin the description of our. algorithm 
using a CAM in which each word contains one 
Simplified LISP cell with only three fields: Left 
(CAR) and Right (CDR), which both point to another 
LISP cell, and a Garbage bit (see figure 2). 


» bHRBAGE 
, 


PL LEFT 


Figure 2. Simplified CAM LISP Cell 


collection with 
if there are any 
CAM operations: 
and a CAM search 
in memory. 


The key observation about garbage 
such a cell is that we can find 
pointers to a given cell with two 
a CAM search of the Left fields 
of the Right fields, of all cells 


Any practical implementation would use CAM 
words to hold _ several different kinds of cells. 
In particular, our implementation uses’ special 


cell types to allow garbage collection of strings, 


name-value bindings, and the primitives of the 
GRASPER graph processing language [9]. We discuss 
how these special cells are handled. after 


presenting the simplified one cell type algorithm. 


When a free cell is needed, a CAM search 
done for a cell whose Garbage bit is set. 
done by the Supply free cell routine in 
(which appears 
these cells is 
operation. This 
the needed free cell. 
however, to propagate 
Sub-structures of this cell. This is done by the 
Potentially make garbage routine in figure 3. We 
do this by first CAM searching the Left and Right 
fields of all other cells for equality to C.Left. 
If there are no responders to this search 
(SOME-NONE has value NONE), then the cell pointed 
to by C.Left is garbage and we set its garbage 
bit. If C.Left = NIL, then the search need not be 
done. We handle C.Right in an identical way. The 
algorithm requires that all cells be initialized 
with their Garbage bits set and their Left and 
Right fields set to NIL. | 


is 

This is 
figure 3 
at the end of the paper). One of 
selected with the SELECT-FIRST 
It is still necessary, 
"garbageness" to the 


A piece of list structure potentially becomes 
garbage when one of possibly many pointers to it 
is deleted. This can occur in several ways during 
the execution of a program. The functions REPLACA 
and REPLACD explicitly delete pointers from the 
left ¢€CAR) and right (CDR) fields of list cells. 
The function SET (assignment) also deletes the 
pointer to a variable's old value. These 
functions all call the routine 
Potentially make garbage on the the pointer they 
are deleting. This routine determines whether’ to 
set the Garbage bit of the head cell of the 


cell, call it C, is returned as ° 


- 


structure pointed to. All sub-structure will be 
handled if that head cell is made garbage and when 
it is actually reused. 


Circular lists cannot be garbage collected in 
our regular scheme because there is always a 
pointer to any cell in the circle. They can be 
accommodated, however, either by requiring the 
user to release them explicitly, or by simulating 
them with a "lazy evaluation" scheme (see Allen 
[1] for details on lazy evaluation). 


Extensions For Other Cell Types 

Our scheme is easily adapted to other’ kinds 
of dynamic data structures. Here we will discuss 
an implementation for strings. Remember that, as 
discussed earlier, our LISP list cell actually has 
seven fields. The simplified cell is augmented 
with a Type field for the cell and for the Left 
and Right fields. These fields are necessary for 
the algorithm, but also allow us to enforce 
typing. Typically, typing is done by putting all 
of one kind of data together so that address alone 
can be used to determine type. In our scheme, if 


a field is of type T, it may only point to a cell. 


of type T. 


Strings are made up of linked lists of cells 
(see figure 4). String cells, like any other cell 
type, must be fit into the existing size CAM word 
and must have Type, Garbage, and Flags fields. 


They also have several bytes of character data and 


also Next, a Cell ptr implicitly of type string. 

The implicit typing | saves space in the cell and it 
does not cause a problem, since string cells can 
point only to string cells. 


GARBAGE 
CELL type VeLass 


1. [awne]] —[RTETATUT=157 


$2 


RI 


trgHT RIGHT 


CELL TYPE (FLAGS LEFT LEFT 


GARBAGE 
The string "REAL-TIME" ($1) and a List cell. (L1) 
whose CAR points to a string beginning "SYSTE". 


Figure 4. Example of CAM LISP Strings 


sotnap acnpmmesetoneed ania omen 


Unlike a list, when the head of a string 
becomes garbage, the entire string is known to be 
garbage. Potential "garbageness" need only be 
propagated down the Next field link and the 


Other _ptrs to operation need not be done, 


For example, in figure 4, assume that cell L1 
is made garbage, When the cell is chosen to be 
reused, we attempt to propagate "garbageness" to 
L1. Left, If there are no other pointers to cell 


fields, atoms have a Value 


53 the string "SYSTE..." becomes garbage. 53 is 
marked garbage and when it is reused no other CAM 
searches need be done. 


Atoms are also implemented as Special cells. 
In addition to the Flag, Cell type, and Garbage 
field and Value type 
field, pointing to the atom's static binding, and 
a Print_ name field implicitly of type string (that 
is, pointing to a cell of type string). 


A Truly Associative "A-List" 


In LISP each function call creates a set of 
name-value bindings which exist during the 
execution of the function and disappear at its 
completion. This is roughly equivalent to the 
formal to actual parameter bindings in other 
programming languages. Traditional binding 
schemes use one or more lists to associate names 
with values, A list used this way is called an 
A-List for Association-List (see Allen [1] for 


more details). 


In our scheme the A-list, like the Free list, 
does not exist. Instead the bindings are held in 
a set of distinguished cells, existing anywhere in 
CAM. When entering a new environment, we 
increment an environment counter and create a_ set 
of CAM cells to hold the names bound in that 
environment, their values, and the new environment 
number. Now we can ask the question above as a 
Single compound CAM search for a name-value 
binding within an environment, and retrieve the 
current binding directly. Since the current value 
of a name might not be in the most current 
environment, we need to search for the greatest 
environment number for that name. 


When an environment is exited, a pair of CAM 
Operations is executed. First a search for all 
environment cells with the current environment 
number, followed by a WRITE-RESPONDERS operation 
to make all these cells garbage. Since no other. 
cell will point to these binding cells, even if 
some do point to their descendants, they can all 
be turned into garbage in one operation. 

Figure 5 summarizes all the cell types 
discussed in this section. 


Garbage, Flags, and Cell_type fields occur in each cell. 


List Left, Left type, Right, Right type 

Atom Print name (implicitly of type string) 
Value, Value 1 type 

String Character 1, »Character rn, Next (implicitly 


of type string) 
Environment_number, Name (depiseityy of 
type atom), Value, Value )_ type 


Environment 


Figure 5. Summary of Cell Types 


Other Issues 


CDR-Coding 


Many recent LISP. implementations use 
CDR-coding , compact encodings of list 
representations which take advantage of 
statistical regularity in list structures (see 


Bobrow and Clark [3] for a summary and discussion 


of these schemes). A CAM augmented LISP with 
CDR-coded cells is easy to imagine, though it 
would require considerable extra time and 


complexity in the implementation of the basic LISP 
operations. Finding all pointers to a given cell 
would, in general, require a CAM search for each 
possible interpretation of a cell pointer field. 


| Given decreasing hardware costs, we did not 
feel it necessary to compromise the simplicity and 
speed of our algorithms. In particular, 
CDR-coding offers no solutions to our primary goal 
of real-time operation since it reduces space 
rather than time needs. 


Virtual Memory 


Our scheme does not support virtual memory. 
In general, it would be impossible to perform the 
test Other _ptrs to on a given cell without paging 
every active page of the virtual memory into CAM, 


The application programs we envision for our 
system can always be tested in advance to 
determine their space needs. More CAM cells can 


always be added without a time penalty. 


Our Implementation 


We are currently implementing the LISP system. 


discussed above 
and 80K bytes 


using a Z80-based microcomputer 
of CAM. The CAM, Semionics 
Recognition Memory (REM) [8], is organized as 320 
256-byte words (called "super words" in the 
company literature), We do not need such long 
words and have cut the memory into vertical 
Slices, yielding 32 LISP cells per word. Although 
this means that many of our CAM operations will 
have to be repeated 32 times in the worst case 
(once for each vertical slice), the system runs at 
an acceptable speed. The real-time properties of 
our system remain intact. 


The project is a pilot study to examine two 
issues. First we wish to show that even with 
relatively slow CAM (bit serial searches on the 
order of 1 micro-second per bit) which is not 
organized to our needs, we can build a real—time, 
self-contained LISP system. 


Second, the graph processing language GRASPER 
useS many associative operations which can be 
Supported by CAM. (b) GRASPER objects have the 


to 


design 


dynamic allocation needs as other LISP 
objects. We will embed a subset of the GRASPER- 
language into our LISP system using the cell 
typing conventions already discussed. We expect 
show the advantages of a CAM based GRASPER 
System as part of a feasibility study for the 
and implementation of a state of the art 
CAM on our VAX 11/780. 


same 


Conelusions 


We have presented a scheme for implementing a 
real—time LISP system by using Content Addressable 
Memories for storage of the basic LISP cells. Not 
only does our scheme perform all elementary 
operations in real-time, it also has the following 
other advantages: 

All cells are available for use, in 
contrast to other real-time schemes. 


1. 


Retrieving the correct value for a name 
can be be done truly associatively, 
always requiring only two CAM operations. 


Ce 


Strings and other dynamic data types. can 
be elegantly and efficiently integrated 
into the basic scheme wi thout 
partitioning memory. 


4, CAM is eminently suited to modern VLSI 
implementation techniques. 

Our scheme does have limitations, however: 

1. Circular lists cannot easily be garbage 
collected. 

2. Our scheme does not lend itself to a 
virtual memory environment. 

We believe that even given the above 

limitations, our scheme is_ an attractive 


alternative for self-contained, dedicated systems. 
It is usable in a real-time environment and all 
basic LISP operations perform extremely quickly. 


We believe that tested AI systems written in LISP 


resentative aa sehen anthesis 


(b) GRASPER is used to represent and operate on 
semantic nets, augmented transition networks 
‘(CATNs) , HEARSAY-II style. blackboards, and other 
associative data structures used by AI projects at 
the University of Massachusetts. 
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could be transferred to a CAM—augmented LISP 
machine without costly redesign and without 
recoding in a standard systems programming 
language (e.g. assembly language or Ada). In 
this way we hope our’ scheme will aid in the 
creation of simpler yet more powerful 


computer-controlled systems. 
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Do eee eee Sate a. anni epee + eee 


Collection 


function Supply free cell : Cell ptr; 
(* called by CONS to find a cell it can use 
to build a list structure with. In addition 
this function does the incremental garbage collect *) 
var Free cell, Temp : Cell ptr; | 
begin 
search for first Free cell from Cell 
wnere re CellfFree_ cell].Garbage 
do begin 
if if Cell{Free cell].Left <> Nil_ptr | 
then begin 
Temp := Cell{Free cell].Left; 
Cell{Free cell].Left := Nil_ ptr; 

(* These two make sure a check for other 
pointers = Cell{Free cell].Left will 
not respond to that field itself *) 

Potentially make garbage (Temp) 

(* propagate "garbageness" *) 

end; 


if Cell{Free cell].Right <> Nil ptr 
then begin 
Temp := Cell{Free cell].Right; 
Call{Free_ cell].Right := Nil _ptr; 

(* These two make sure a check for other 
pointers = Cell{Free cell].Right will 
not respond to that field itself *) 

Potentially make garbage (Temp) 
(* propagate "garbageness" *) 


end; 
return Free cell 


end 


else System_error ("Cell space full") 
end; 


procedure Potentially make garbage (C : Cell_ptr); 


Cell(C].Garbage := not Other _ptrs to (C) 
end; ) 


function Other_ptrs to (C : Cell ptr) : doolean; 
var Responder : Cellptr; 


begin 
Search for Responder from Cell 
where not Cell{Responder].Garbage 


and CelllResponder].Left = C 
do return true 
else search for Responder from Cell 
where not Cell{Responder].Garbage 
and Cell[ Responder]. Right = C 
do return true 


acca d 


else return false 


end; 


(continues on the next page.,.) 
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(...figure 3 continued) 


procedure Init_CAM; 


var Responder : Cell ptr; 
begin ~ 
search for Responder from Cell 
where true 
do begin 
Cell{Responder].Garbage := true: 
Cell{Responder].Left := Nil ptr; 
Cell{Responder].Right := Nil-ptr 
end 


end; 


Notational Conventions 
The CAM is seen as an "associative" array of records, where 


each record represents the data in one CAM cell. Standard > 
indexing into the array allows us to treat the CAM as_ RAM. 


From the above we have two data types: 
Cell _ ptr = 1..Num cells; 


Cell = associative array [Cell ptr] 
of record 
Garbage : boolean; 
Left, Right : Cell ptr 
end 


The basic CAM operation is: 


search for [ first ] <index variable into CAM> 
from <CAM array name> 
where <boolean expression> 
do <statements> 
else <statements> 


The <index variable> is available within the do <statements> 
to syntactically represent all cells that neet the search 
criteria. This <index variable> is a free variable ranging 
over all possible values, that is, indexing all cells in the 
CAM array. For each CAM cell where the <boolean expression> 
is satisfied, the do <statements> are executed. The do 
<statenents> are performed in parallel for these cells. In 
the case of "search for first", the index variable gets set 
to the value of the first responder. In the case that no 


cells satisfy the <bool ean expression, the else 


<statenents> are executed. Typical CAMS do not support the 
generality implied by this construct. In particular, 
arbitrarily complex <boolean expressions> will take N CAM 
searches, where N is the number of disjuncts in a 
disjunctive-nornal—form version of the <boolean expression, 
and do <statements> are limited to assignments to the cells 
indexed by the <index variable>. Other operations can be 
supported either by more intelligent CAM cells or bya 
micro-coded CAM controller. Our algorithns use the 


construct in ways easily implemented in CAM. 
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ABSTRACT 


MAP is an associative multiprocessor of medium 
size. It has been designed for experimentation in 
pattern recognition area - especially speech reco- 
gnition. The machine is composed of sixteen micro- 
programmable processors. At the microprogram level, 
every processor is autonomous and can perform its 
task without receiving any external command. At 
the collective level, control is assumed by an 
extra master processor. This processor is concer- 
ned with Input-Output and common orders distribu- 
tion. The architecture presents special accomoda- 


tions for synchronization between processors. Some © 


of them are driven by an. associative arrangement. 
The total instruction rate is 68 MIPS, allowing a 
real time processing of the speech. 


INTRODUCTION 


The architecture of a multiprocessor machine 
must optimize, both data and instruction flows. 
Often, these two goals appear to contradict each 
other. However some facilities may occur when the- 
se flows are driven with a good regularity or re- 
petition of simple pattern. Particularly, when a 
unique model of control distribution and data ma- 
nipulation may be taken as a general representa- 
tion of the behaviour of the processors, the ar- 
chitecture may be designed according to it (for 
example - vector computing with SIMD architecture). 


In the most general case it is quite impossi- 
ble to find out such a model and, accordingly, to 
obtain a satisfying balance between two conse 
traints : 


- availability of a flexible control scheme 
for parallelism able to support distributed algo- 
rithms 


- realization of a fast and simple communica- 
tion tool between processors. 


The first goal implies the design of indepen- 
dant and autonomous processors but, conversely, 
represents a difficulty for getting an easy solu- 
tion for the second one. Communications have to be 
localy controled by each processor according to a 
communication protocol. Then, data transfers, are 
complicated and slow. . 


So, when studying a special purpose architec- 
ture the first step is to point out the regular 
properties of the application involving facilities 
in control distribution and data. 


0190-3918/81/0000/0120$00.75 © 1981 IEEE 
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Pattern recognition applied to voice analysis 
has two typical characteristics 


- Input data flow is strictly sequential and 
periodic 


- The amount of data to be held at a time is 
not very large and may be easily ordered. 


THE SPEECH RECOGNITION CONTEXT 


The most usual way to drive a speech recogni- 
tion process is to use a mathematical representa- 
tion derivated from.signal processing models [11]. 


The aim of the project is to use an associa- 
tive model related with a data base organization 


C4]. 


Speech processing may use as an input unit, a 
channel analyser. It is composed of sixteen input 
filters distributed along the voice spectrum. At 
every sampling period, a filter issues a digital 
value in proportion to the quantity of energy re- 
ceived in the channel. 


According to the noise and the limits of pre- 
cision, one value may be represented by a binary 
positive number of height bits. Then a sample is 
a.16 bytes vector or a 128 bits word. Period may 
be taken between 10 and 50 ms. 


The input data flow may be looked as a two 
dimensions array in a timefrequency diagram (fig. 


1). £9][10]. 


At the phonetic level, the element to be 
identified is named "phonem" and represents a ty- 
pical sound produced by the speaker. [12][13]. 


A phonem stretches itself in the two direc- 
tions time and frequency -as a fuzzy pattern- it 
seems to be possible to recognize such a pattern 
by comparison with models which have been stored 
in an associative memory [2][8]. Unfortunatly the - 
direct comparison is impossible and it is neces- 
sary to extract from the input flow some charac- 
teristic informations such as : 


- mean value 
- peak location 
- ratios in upper and lower frequencies 
- measurement of relief 
etc ... 


Theseinformations come out from an horizontal 
(time) or vertical (frequency) or mixed analysis 


L384. 


Using these informations, a process must fol- 
low various tracks among the stored patterns used 
as references. It has to compute a dynamic score 
for each:of them and to decide : 


- rejection of a bad candidate 

- acceptance of one or several good 
candidates .(a choice will be done 
at the upper lever referring to syn- 
tax or semantics) | 


- pursuing the operation with the following 
samples 
- activate new candidates. 


Although MAP is designed for experimentation rather 
than for exploitation, the previous considerations 
seem to be general and lead the organization of 
storage. The informations used as references are 
sets of samples. These sets are organized in files 
Some files are time indexed and represent phonems. 
Some files are type indexed and gather all the 
samples which have similar properties or measure- 
ments (fig. 2). 


A veference sample located in the date memo- 
ry may belong to several files and the associative 
process will have to follow various links before 
idntifying a phonem. Consequently this memory must 
present the following characteristics : 


- basic items.are samples (128 bits) 

- there is a need for a fast (parallel) 
access to one sample. 

- facilities must be provided for multifiles 
description (linkage). 


It is obvious that a multiprocessor is adequate for 
such a processing [5], [6] ,[7] including 


- parallel computing in order to extract 
significant informations from unknown sam- 
ples. 

- parallel access to models of patterns, 

- comparisons between vectors and measure- 
ment of distances. 


GENERAL DESCRIPTION OF MAP 
The processing model derived from this appli- 
cation may be described by the flow of fig. 3. 


MAP has been designed from this model with 
two control levels. It is composed of sixteen 8 
bit microprocessors. The Low Level Control (LLC) 
is local to the PE. and brings facilities for auto- 
nomous processing. The High Level Control (HLC) is 
unique and has to drive, organize and synchronize 
collective activities. 


At the High Level a single control unit 
issues general commands that are identified at the 
same time by the PEs. A general command is initia- 
ted when the former one has been achieved by all 
the processors. From this point of view, the machi- 
ne seems to have an SIMD architecture with a se- 
quential running of the program. 


At the Low Level, a specific program is loca- 
ted in the control memory of each processor. So 
every processor is able to perform its own and par- 
ticular part of the task. A processor must take 
into account 


- its own location 

- its status (resulting from previous opera- 
tions) 

- informations produced by neighbours. 
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Two buses allow communications between HLC 
and LLC. The command bus is provided for distri- 
buting general commands or common data in paral- 
lel. The control bus is organized in a polling- 
selecting manner and driven by the HLC processor. 
By this means the HLC processor may observe clo- 
sely the activity of the PEs and pick out final 
results. 


Every processor may access two routing re- 
gisters. The first one -128 bits- may be shifted 
to the right along all the sixteen PEs. The other 
one -136 bits- may be shifted to the left via the 
HLC processor. Routing operations are controled 
by the processors themselves. Two neighbour pro- 
cessors or a consecutive set of processors may 
request a partial use of these buses for local 
communications. 


The storage is divided in two parts : one 
processor possesses its own control memory. Accor- 
ding to the characteristics of the chip - 8 x 300- 
this control memory is a 4K - i6bits RAM- . During 
processing, this memory cannot be altered and is 
used as a ROM. 


Data memory is organized in a 128 bits wide 
- 16 K words store. Every slice of 8 bits is de- 
dicated to one processor. There are 9 adresses 
producers : every pair of processors PLUS the HLC 
processor may access the data memory through a 
priority encoder. This unit is provided for con- 
flicts management but, most of the time, these 
conflicts may be avoided by synchronization at the 
LLC. 


They are mainly two types of informations to 
be stored in the data memory : 


- voice samples represented by sixteen orde- 
red bytes 

- linking informations, that is to say, 
addresses represented by 8 double bytes. 
One word of memory contains 8 links, thus 
a sample described by this word may belong 
to 8 different files. 


As these linking informations are used to cons- 
truct complex data structures between reference 
samples, the data memory is seen as a special pur- 
pose, read only, data base with a fastened access 
and a limited capacity. 


THE PROCESSING ELEMENT 


In spite of an appearance of choice, they 
were not a great amount of possibilities for the 
microprocessor of a processing element 


- a custom designed processor was rejected 
because of the delays 

- rapidity is a major argument 

- ability for microprogramming is important 

- data manipulations are considered to be more 
interesting than computing possibilities 


The typical architecture of 8 <x 300 from Signetics 


seemed to present the best characteristics for 
these criterions. [1] 


A processing element is composed with : 
~ CPU : 8 X 300 - 250 ns for one instruc- 
tion 


~- three registers for memory control 

- four-registers for routing 

- two registers for exchanges with the con- 
trol and command buses 

~ two registers for sorting 

- two registers for synchronization 

- four 16 K bits static RAM, aEPavSeS in 
4 K - 16 bits store. 


All the program is loaded into the RAM before 
starting, though a special loading bus. This pro- 
gram is composed of : 


- a general command analyser 

- various sequences corresponding to the 
commands. The maximum number of sequences 
is 255. They are initiated by HLC 

- synchronization and communication proce- 
dures. 


During a sequence a processor is able to 
access the data memory, to exchange informations 
with its neighbours, to receive and send informa- 
tions from or to the HLC processor, to present and 
accept synchronization demands and, of course,to 
perform local computations. A macro-assembler bring 
facilities for writing the sequences in parallel. 


Pig. 4 show the architecture of a PE and fig. | 
5 is a simplified representation of the program 
organization. 


In order to increase the performances.of col- 
lective operation. a wired sorting unit has been 
added to the processors. This unit gives at any 
time the maximal value among those presented in 
parallel by all the processors. This SORTER is a 
tree and returns to all the processors the number 
of the winner. 


It takes three instructions (750 ns) before 
getting the results of a sorting operation : 


MORE VALUE TO RSORT 
COMPARE WINNER'S CODE TO LOCAL CODE 
JUMP IF NOT EQUAL 


Many general purpose sequences have already 
been written, let us give some examples (I is one 
instruction or 250 ns) 


- compute the mean value rounded in 
one byte (12 I) 

- compute the location of gravity center 
(16 I) 

~ compute the moment of. inertia with regard 
to a processor (28 I) 

~ find the best ressemblance between a 
given sample and a file of 2 references 
with distance; 
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16 
d= b Ix; - 7, > (5 + 10 £).1 


~ find the best ressemblance with distance 


16 
} Vic? z ar > (5 +18 £).1 


d= 


The general control processor is also a 8 x 300 
module. Over and above the communications with the 
PE, it has a private memory used as a general 
control store. This memory is shared in a multi- 
access arrangement with a conventional processor. 
Because of the low rate of the inputs and outputs, 
a microprocessor is sufficient, then the HLC pro- 
cessor has only to search and distribute general 
commands. 


This host processor also has two extra roles 


- load the PE, programs before executing 
a program 

- compile new programs to be loaded from 
macro-assembler to 8 x ad machine lan- 
guage. 


Pig. 6 shows the architecture of the whole system. 


THE SYNCHRONIZATION UNIT. 


Because it is the most important part of the 
distributed control, the synchronization unit will 
be described in detail. 


There are two occasions where processors must 
execute in a synchronous way : 


- at the end of a LLC sequence in order to 
obtain a new general command from the HLC 
processor | | 

- before communications, sorting operations 
or memory accesses. 


All the processors must have exactly the same 
behaviour during the operations because all of them 
are working at the same level. For this matter the 
synchronization is designed according to an asso- 
clative model. 


The first family of synchronization tools is 
applied to well delimited sets of processors. For 
one given set each processor K has two flags 


- a D, Flag used as an output device 
(detland of synchronization in the set) 

- a C, Flag used as an input device 
(command of synchronization for the who- 
le set). 


The logical relationship between these flags | 
is easily realized with a unique AND circuit : 


C D 


K = 1 n D D 


Nn 


att n Dy fi eiass 


Synchronization occurs when each of the PE. of the 
Set execute the same sequence of instructions : 


K = ijn 


SET D,. = TRUE ; 


WAIT ; WHILE Ce. = FALSE GOTO WAIT 


NEXT : 


As a common clock drives all of them, the proces- 
sors are going to execute the NEXT labeled ins 
truction at the same time. This operation is pos- 
Sible because the 8 X 300 processor is able to 
perform in one instruction the test of C. and the 
corresponding jump. 


Tools have been wired for the following sets: 


Sets of two PE, : (POP, )(PoP,)..- (Py Pi.) : Flag D, 

(P iP )(PoP). +. (Py PQ) Flag D, 
Set of four (POPsPoP4)--(Py5Py5P4,P1,5) Flag D, 
Set of sixteen (POP, ssa dederd : (P5,Ps5) Flag Di, 
Set of seventeen (P P_.......... Pai sPus. plus “the 
HLC processor ) Figsps ay 


this latter set is particularly used before get- 
ting a new general command. 


Another manner to obtain synchronization bet- 
ween groups of processors consists of a dynamic 
construction of the group. The interest of such 
a tool is to allow synchronization by observing 
the results of processing rather than the location 
of processors. This is necessary within an asso- 
clative process when processors may issue some 
specific result, the value of which is significant 
for driving cooperation between them. Namely a 
subset of the network may request a synchroniza- 
tion because every processor of that subset holds 
a typical result while all the others do not. 


For that purpose, a processor may display one 
of the heigt names (0 to 7). 


Names O and 1 have special meaning 
O : the choice of a synchronization name has 
not yet been done 
1 : no synchronization required 
2,7 : effective synchronization names. 


The management of these names is realized accor- 
ding to the following rules 


(a) if there exist, at least, one synchroniza- 
tion name equal to O no synchronization is 
possible 


(b) in order to allow other groups to synchro- 
nize, a processor must display its choice 
(4 0) as soon as it is in position to do 


(c) displaying a name is not realy a request for 
synchronization ; the request is represented 
by an extra D flag. (D6) 


(d) among the processors that have displayed the 
same name, and AND circuit is dynamicaly 
provided and delivers the C command when all 
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the D flags have been switched on.. 


Realization is quite simple and entirely sta- 
tic. First a gate is provided in order to take in 
charge the (a) rule. 


As there are six names, each of them is con- 
troled by one AND gate. Every gate is controled 
by all processors through a network driven by na- 
mes. This network must decide whether one name, for 
one processor, is active or not. 

Fig. 7 presents an illustration of some usual 
cases of synchronization. 


CONCLUSION 


It is easier to design a special purpose pro- 
cessor than a general purpose one. The behaviour 
of programs is more closely identified and a spe- 
cific model of instructions and data flow may be 
established. Accordingly, the architecture is more 
sophisticated and the performance increased. 


In the MAP project these considerations gave 
the possibility to take advantage of two points 


- the main data structure is a fixed vector, 
- the control may be separated in a high 
(general) and a low (local) level. 


The former point imposed to realize a very 
flexible synchronization system between proces- 
sors. Such a system brings a great facility for 
writing parallel programs. 


This study was supported by CNET, the French 
administration for Research in Telephone and Tele- 
communication area. The machine in now under test 
and must be operationnal in a few weeks. 
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EXAMPLE OF THE SYNCHRONIZATION PROCESS. 


AIRBORNE ASSOCIATIVE PROCESSOR (ASPRO) 


Jon M. Surprise 
Program Manager, Digital Technology Department 
Goodyear Aerospace Corporation 
1210 Massillon Road 


Akron, Ohio 


Introduction 


Under company sponsored Research and Develop- 
ment programs and subsequently under Navy Contracts 
00019-78-C-0598 and 00019-79-C-0563, Goodyear 
Aerospace Corporation performed extensive tradeoff 
studies based on experience with STARANIM, to 
demonstrate the advantages of associative pro- 
cessing for airborne surveillance. Two advantages 
are: the simplicity of the software for managing 
the surveillance data base, and the high inherent 
processing speed of ASPRO. The necessary small 
size (0.35 ft3) and low power (330W.) are realized 
using custom CMOS VLSI and multichip CMOS random 
access memory. The ASPRO processor, now in final 
development, will augment the existing data pro- 
cessor aboard the Navy's Grumman E-2C aircraft. 
Its combination of content-addressability, multi- 
dimensional access (MDA) memory, and parallel 
processing provide a powerful architecture for 
real-time processing applications. 


Architecture 


The basic architecture of ASPRO is shown in 
Figure 1. 


DATA | CONTROL DATA 


MEMORY MEMORY 


BUS B 


PROGRAM 
EXECUTION 
CONTROL 


ARRAY 
CONTROL 


CONTROL 


ARRAY UNIT 


Block Diagram of ASPRO 


Figure 1. 


ASPRO is divided into five functional sub- 
systems: 


Control Memory. This subsystem is made up of three 
types of storage: (1) buffer memory, (2) program 
memory, and (3) read-only memory (ROM). 


The buffer memory provides storage for input 
and output data for the ASPRO. It consists of two 
identical modules, each capable of storing 8192 
words of 32 bits each. Each buffer memory module 
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has three access ports. Two of the access ports 
are connected to the two memory buses of an 
external computer. The third port is connected to 
the ASPRO's internal bus system. 


The program memory, which is loaded through 
the buffer, provides storage for the ASPRO machine 
instructions. Its one access port is connected to 
the ASPRO bus system. 


The ROM provides nonvolatile program storage 
for certain essential operations including pro- 
gram load and basic built-in test routines. 


Program Execution Control. This unit controls 


execution of instructions stored in program memory. 
Four index registers are provided, as is a sub- 
routine stack capable of accommodating 15 levels 
of subroutines. Conditional branches to any 
location in program memory can be executed. To 
maximize performance, fetch of the next instruction 
is initiated at the earliest possible stage of the 
current instruction such that it overlaps the 
current instruction execution. 

Register and Arithmetic. This section contains © 
the working registers, sequential arithmetic unit, 
and buses required for data transfer and control 
exclusive of the array. The logic in this unit 
consists of twenty-four 16-bit registers, two 32- 
bit registers, and a 16-bit arithmetic logic unit 
(ALU) interconnected by a bus system. The 32-bit 
memory bus is connected to one port of the common 
register and to the 32-bit instruction register. 


Data to and from the array unit flows through 
the register and arithmetic logic. The 32-bit 
bidirectional array data bus is split into an 
array input and an array output bus by the inter- 
face logic. All data transferred to and from the 
array are buffered by the common register. 


Sixteen general 16-bit registers and eight 
specific 16-bit registers are accessed via a 
16-bit data bus. The general registers are loaded 
from the ALU output and can be used for either ALU 
input argument. The specific registers are 
dedicated to array operations to hold array 
addresses and loop counts. 


The arithmetic logic unit (ALU) permits con- 
ventional arithmetic and logic operations to be 
performed upon data presented to the working 
registers. It can perform seven arithmetic and 
nine logic operations on two 16-bit operands. 
Multiplexers at the ALU inputs provide the 
capability to select various pairs of source 
operands. 


Array Control. This unit provides the timing 

and control to execute the specified array 
operation. Basic array operations include Read 
Array, Write Array Masked/Unmasked from Common or 
Array Register, and Output to Common Register. 
When reading from the array, array control sets 


up control lines to perform one of 16 possible 
Boolean operations between array data and the 
Processing Element registers. Jogical sequences — 
of these operations permit a wide variety of 
associative functions to be performed on the array 
data. 


Array Unit. The multidimensional access (MDA) 
array unit consists of four basic components: 
array memory, flip (permutation) network, pro- 
cessing elements, and response-store resolver. 
is partitioned into 17 array modules. Sixteen 
modules of 128 words each make up the 2048-word 
array. Each word is 4096 bits in length. The 
seventeenth 128 word module is a spare which 
may be switched in if one of the basic modules 
fails. Each module comprises a 128-word by 
4096-bit array of solid-state MDA storage and 
128 processing elements (PE's). 


The 2048 words of 4096 bits each provide a 
total of 8 megabits of data storage. Format 
of the 4096 bits is under total software control. 
Operand lengths can be 1 to 256 bits. The MDA 
storage organization provides access in either 
the bit or word direction, a technique proven in 
the STARANIM associative processor from which 
ASPRO has evolved. The flip network and associa- 
ted address logic: permits MDA using conventional 
RAM. This, in conjunction with the response store 
and resolver permits parallel processing and 
content~addressability without sacrificing normal 
word-mode input-output. 


The array includes 2048 PE's. Each PE 
contains 3 single-bit registers, and can: buffer 
data from or to array memory, execute all logical 
operations on two single-bit operands, condition- 
ally inhibit a write instruction, and provide 
the response store function for search operations. 
Associative array input and output is 32 bits via 
the common data bus. The array is partitioned to 
provide for reduction in required volume and 
power by the efficient use of custom VLSI circuit- 
ry. CMOS/SOS technology has been used for the 
PE VLSI integrated circuit design because of its 
low power and high speed. 


It 


Software 


A significant amount of system software is 
being provided to allow users to develop applica- 
tion programs. Software tools include: assembler, 
linker, loader, librarian, subroutine library, a 
debug package and diagnostics. Most of the system 
software is written in a high order language for 
portability, enabling program development on a 
variety of general purpose computers. 


The assembler is a conventional two-pass 
assembler which supports structured modular pro- 
gramming. The mnemonics are separable into two 
sets. One set is for the sequential control 
portion of ASPRO and is much like the instruction 
set for conventional sequential computers. The 
second set is for the associative memory and 
consists of double and triple address arithmetic 
and logical operations. _ 
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The output of the assembler is an object 
module which can be combined with other object 
modules via the linker and librarian into a load 
module. The load module, when loaded into the 
ASPRO can be interactively debugged with the 
debug package. The debugger allows the user to 
stop the program at any program location, dump 
registers or memory contents, change those 
contents and then continue the program. In the 
trace mode, selected registers and memory can be 
dumped automatically after every instruction is © 
executed. These and other features of the ASPRO 
debugger provide the user with a powerful de- 
bugging tool. | 


Two types of diagnostics are being developed 
for ASPRO: an on-line self-test program which is 
executed periodically to assure operational in- 
tegrity and off-line diagnostics to isolate faults 
to a specific section of hardware. 


Performance 


The relative processing time for ASPRO in a 
radar tracking application is significantly less 
than a conventional processor when the number 
of tracks increases from several hundred to 
several thousand. 


Table I is a simplified comparison of pro- 
cessing time for some typical operations on a data 
base of 2000 items. 


Table I. 


Performance Comparison 


ASPRO CONVENTIONAL 
ASSOCIATIVE | COMPUTER 
PROCESSOR 


ISINCLE-BIT SEARCH | 0.5 uSEC 
16-BIT ARITHMETIC | 32 uSEC 
OPERATION 


OPERATION ON 
2000 ITEMS 


1000 wuSEC 
3000 USEC 


The ASPRO processor is a dense, low-power, 
high performance processor. This parallel pro- 
cessing system is designed to replace or augment 
existing, conventional airborne data processing 
systems. ASPRO's simple software and high-speed 
search and processing capabilities provide a 
unique, cost-effective solution to real-time 
signal processing. 
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Summary 


The scope of various applications of the 
associative or content-addressable processors 


(see, e.g.[1]) is extended to random walk modelling. 


The main problem in this modelling is to provide a 
random choice among a set of alternatives. Let 

us consider n alternatives with probabilities P.. 
The choice of an alternative with corresponding 
probability can be presented as a hit by a random 
number R in the range (0-1) of one of the intervals 


where 
g,= 0; 6) Pig Sy = Pp * Pass a By SD 


This procedure can be organized as a search among 
the numbers S., 51» Dig: Se a for that which 
is the Taeeeee smalle? than R. “Using an associa~ 
tive memory of ternary elements the intervals 

(S, , - S,) can be presented in such a way that 
this search will be performed with one memory call 
[2]. (The third state of the ternary associative 
element (-M) provides matching signals for both 
"Oo" and "1" interrogations, and it can be imple- 
mented either with special hardware or with soft- 
ware using two bit combinations in binary 
associative memory. ) 


We will illustrate this method by an example 
(Fig. 1). Suppose we have four alternatives with 
the probabilities 2/16, 5/16, 6/16 and 3/16. The 
total random number range 0000 < R < 1111 can be 
covered, for example, by the following ternary 
combinations: 


iy a “QOOM 2/16 


2 - OO1IM —- 5/16 
O1LOM 


O110 


OQlll - 
1OMM 
1100 


6/16 


11i0l - 
111M 


3/16 


Such a representation is a subject of minimi- 
gation (cf. Fig. l,a). The alternatives 1, 2, 3 
and 4 will be accessed with the probabilities 2/16, 
5/16, 6/16 and 3/16, respectively, because the 
chances for a random number R to match to one of 
these intervals is proportional to its length, 
i.e., 2, 5, 6 and 3. 


chain, p.., can be stored in a format: (i, V, 3), 
where i i8 a starting state, V - a ternary combi- 
nation corresponding to the choice of the j state. 
The interrogation of the associative memory by 

(i, R) will result in a random choice of "j" with 
the probability Pass i.e., "a transition isj". The 
process of Markov dhain modelling is a succession 
of such transitions. 


The associative pipelines as suggested in[3] 
have actually the same algorithmic capabilities as 
associative processors, but the pipelines are more 
efficient in implementation and suitable for pro- 
cessing of large volumes of information. The 
uniform cells of the associative pipeline realize 
in succession the transformations isomorphic to 
that realized by the associative processor in 
parallel. The random choices are made in the pipe 
line cells by picking-up the numbers of the alter- 
natives from the passing word-stream when the 
ternary combinations corresponding to their proba- 
bilities match the provided random numbers. All 
cells operate on the word-~stream concurrently with 
the shift in time according to the propagation 
delay. The associative pipeline can easily perform 
the first-match selection for the multiple respon- 
ses, so the necessary intervals for random choice 
can be constructed simpler using overlapping 
ternary combinations with partial screening of the 
successors. This is illustrated in Fig. l,b. 

There are two possibilities for selection alterna- 
tives #1, three ,possibilities for # 4, six possi- 
bilities for # 3, and five for # 2. It does not 
matter that the alternatives are not presented by 
contiguous segments; if R is uniformly distributed, 
the chances of the selection of the alternatives 
will correspond to their probabilities, i.e., 

#1 —- 2/16, # 2 - 5/16, # 3 - 6/16 and # 4 - 3/16. 
It should be emphasized that the choice of the 
alternative is performed in each cell independently 
and is determined by its own random number R only. 


The basic unit of the computer system for 
modelling Markov chains is presented in Fig. e. 


The output of the pipeline cell is connected 
to its command register to extract information 
from the word-stream. The possibility of the con- 
trol of the computing process through the word- 
stream is an attractive property of the associative 
pipelining, which can be efficiently used in 
different problems as, for example, considered in 
[4]. The word-stream is a mixture of transition 
matrix elements in one format and control computer 
messages in another format. The mode of operation 


PHS A panes i Oe eas OP nad Seer oie is specified by tag bits, which also serve as a lock 
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to permit some operations on a given word and to 
prevent further access to this word in other cells. 


of 


The basic operations are the following: 


Initialization - a random walking point should 
be set into a certain initial position. 


Transition - moving from a given state i to — 
one of the states j according to the 
probabilities (P55) 


Random number supply - after each transition 


the random number R must be changed to deter- 
mine the choice of the next alternative. 


sensing — the results of random walking should 
be returned to the word-stream and processed 
by the control computer; two types of Markov 
chains are usually considered: with and 
without absorbing states, the results of the 
modelling are some characteristics of the 
random walks to absorbing states in the first 
case, or of the equilibrium distribution in 
the second case. 


The suggested technique is a typical example 


the organization of the computing processes 


eo FEET elo 


jo fz ts Pe Ts Toh 


Ee Jo [= [=f ob 


a 


with associative pipelining. Not bound by storage 


limitations, it can be efficiently applied to the 
investigation of very large stochastic models in 
system analysis and computational physics. 
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Fig. 1 Random choice with the associative 
processor (a) and pipeline (pb). 
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ABSTRACT -- This paper considers reconfrguta- 
tion of dynamic architectures into multicomputer 
networks that can assume rings, trees, and stars 
configurations. 

Reconfiguration algorithms introduced are 
one-step algorithms performed concurrently by all 
network nodes requested for reconfiguration. The 
time of this step 4s the time to execute a one- 
bit shift and mod 2 addition. These recongigura- 
tions can be accomplished with special shrft- 
negisters called shift-registers urth vartable 
bias (SRVB) introduced into each network node, N, 
that stone the posction code of this node. Upon 
necetpt of the reconfiguration instruction, each 
Such negister generates the posrtion code of the 
network node, N*, with which node N must estab- 
kish a data path consistent with the overall net- 
work configuration (tree, star, or ring). 


1. INTRODUCTION 


As was shown in the literature [1-3], a 
dynamic architecture may increase a. system 
throughput using the following adaptations to 
algorithms: 

1) Adaptation of the resources to instruc- 
tion and data parallelism, and 

2) Reconfiguration of the resources into 
multicomputer, multiprocessor, array, and pipe- 
line architectures. 

A multicomputer adaptation to algorithms is 
generally understood as: 

a) The architectural capability to parti- 
tion resources. into a variable number of dynamic 
computers with changeable word sizes, and 

b) The capability of multicomputer archi- 
tecture to function as a multicomputer network 
characterized by different topological configura- 
tions among its computers. 

Since various techniques that implement the 
a) property of dynamic architectures were studied 
in [2-5], this paper concerns itself with recon- 
figuration of dynamic architecture into a multi- 
computer network. 


2. APPLICATION OF MULTICOMPUTER NETWORKS 


A multicomputer network is characterized by 
different network structures formed by network 
computers, otherwise called network nodes. As 
was shown in the literature [6-12], the most con- 
venient network structures are rings, trees, 
stars, binary cubes, closely connected graphs, 
and mixed structures that involve various combina- 
tions of the above mentioned structures. 

Rings, cubes, and strongly connected graphs 
are useful for computational algorithms in which 
each computer node performs computations only and 
assigns no computations to other nodes. 
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Trees and stars are useful for both computa- 
tional and control algorithms. Trees and stars 
are described by two types of nodes--Leaves and 
non-Leaves--where a leaf is generally understood 
as a node of the lowest level, i = 0, and a non- 
leaf node has level i > 0. A node(s) of the 
highest level, i = L, is called a ‘oof... 

A tree or a star may have one or several 
roots describing a corporate structure with one 
or several directors, respectively. The differ- 
ence between a star and a tree is in the number 
of nodes of a lower level that are adjacent with 
a node of level i. Ina tree, this number does 
not exceed one for a root, and two for each node 
that is neither root nor leaf. For a star, each 
non-leaf node of level i may have more than two 
adjacent nodes of a lower level. 

If a tree or star has one root it is called 
a one-noot tree or a one-rcot stan; if it has 
several roots, it is called a multiple root tree 
or a multiple root star. 


3. REQUIREMENTS FOR A MULTICOMPUTER NETWORK 


In order to provide a multicomputer network 
with very high flexibility and reduce the amount 
of data to be transferred among its nodes, any 
network must be provided with the following 
characteristics. 


Cl. Minimal Reconfiguration Time. This is 
understood as the minimal time required by the 
network to reconfigure itself into any of the net- 
work structures indicated above. 


C2. Multifunctional Node. This is under- 
stood as the capability of each node, N, to be 
connected into any network structure (ring, 
strongly connected graph, cube, tree, or star). 
Within a tree or a star a multifunctional node 
should be capable of functioning as a root (single 
or multiple), leaf, or a non-leaf node. As a 
result, a programmer will be able to minimize 
idle resources not involved in a particular compu- 
tation and to eliminate traffic bottlenecks 
created in particular portions of a network due 
to overcentralization of information flow from 
the root(s) to other nodes. 


C3. Variable Word Sizes of a Network Node. 
To increase the network flexibility each network 
node must be provided with the capability to 
change its word size. This will minimize the 
amount of resource interconnected into a particu- 
lar network configuration, and allows computation 
of additional programs using the same resources. 
The advantages of such computations are coinci- 
dent with those performed by dynamic architec- 


tures in general. 
in [2-5]. 

A multicomputer network that is provided 
with properties Cl, C2, and C3, and performs 
reconfigurations into the network structures 
described above, can be organized using the DC 
group described in [1-3]. 


These are treated extensively 


4, CONTRIBUTION TO THE STATE-OF-THE-ART 


This paper studies reconfiguration of dyna- 
mic architecture into rings, trees, and stars. 
Reconfiguration algorithms developed are based on 


the theory of shift-register sequences as follows. 


A network node N activates a data path with 
its immediate successor, N*, in the given network 
structure when N generates the position code of 
N* [2, 3]. This activation may be done with the 
use of shift-register and special constant B 
brought with the reconfiguration instruction to 
all network nodes that are requested for recon- 
figuration. 

4.1. Rule of Succession during Reconfiguration 
To activate each data exchange that can be 
either PE-ME*, PE-PE*, ME-PE*, or ME-ME*, it is 
sufficient for the processor element, PE, belong- 
ing to computer element, CE, identified with net- 
work node N to generate the position code of the 
CE* identified with the network node N* that 
contains a second element of the exchanging pair 
(ME* or PE*). This will be denoted as transition 
N + N* meaning that: a) N will generate position 
code of N*, b) N will establish a given data path 
between N and N*, and c) the data path activated 
by N can be made bidirectional--either from N to 
N* or from N* to N--and it can be one of the four 
types considered above (PE-ME*, 
ME-PE*). 

(Here for simplicity it is assumed that each 
node N 1s equivalent to one CE. An extension of 
the results accomplished to a dynamic computer, 
C(k), assembled of k CE can be done very easily 
by ea the same position code to all its 
CE" Ss 

To minimize the time of reconfiguration, it 
is reasonable to assume that for each network 
Structure, such rule of succession, N + N* should 
be maintained during reconfiguration for which 
each node N has a minimal number of successors N* 
in this structure. Then it will take the minimal 
reconfiguration time to establish all the data 
paths between N and each of its successors, N*. 

While for rings, the rule of minimal number 
of successors is trivial, for trees and stars it 
requires that the succession be maintained in the 
direction from leaves to root(s). Namely for 
each N + N*¥, the level of N is lower than N*. 

If all these paths are established concur- 
rently, the entire network reconfiguration takes 
time T of activating only one network transition, 
N + N*, and can be performed with a one-step re- 
configuration algorithm performed concurrently by 
alJ network nodes. 


4.2. Application of Shift-Register Theory 


In this paper, trees, stars, and rings will 


PE-PE*, ME-ME*, or 
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be generated with the use of shift-register 
theory. Its application proceeds along the fol- 
lowing lines. 

Assume that each network node N is proyided 
with a special shift-register of length n which 
stores its position code N, where n is the size 
of the code (Fig. 1). Suppose that in the given 
network structure to be assumed, node N should be 


network 
node WN 


network J } 0 0 
node N* 


’ to SEL channel 


FIGURE 1 


O111 


Shift-Register with Variable Bias B = 


succeeded by node N* via PE-PE*, PE-ME*, ME-ME*, 
or ME-PE* data path. Then for each type of .com- 
munication between N and N*, node N generates 
position code N* using a left-shifted shift- 
register that generates N* as follows: 


N* = 1IN]@B (1) 


where 1[N] is one-bit shift of N to the left and 
B is an n-bit reconfiguration constant brought 
with the reconfiguration instruction to all net- 
work nodes that are requested for reconfigura- © 
tion. Reconfiguration constant B will be called 
bias and the shift-register of Fig. 1 is called 
a shigt-register with variable bias (SRVB). 

In Fig. 1, N = 1101 = 13, B = O11] = 7. 
This gives N* = 1[£1101] ® 0111 = 1011 @ 0111 = 
1100. Therefore network node N,. generates posi- 


tion code N* = 1100 = 12 of its successor in the 
given network structure. This code will then 
activate a given data path between nodes Ny3 and 
NF 5 

Ne The gate FBG in Fig. 1 is called a feedback 
gate. Introduction of the FBG gate allows a 
Shift-register, SRVB, to.perform two types of 
shifts: a) circular TEN]; when feedback input 


FI = 13 and b) non-circular TIN],; when FI = 0. 

As will be shown later, if FI = 1, concur- 
rent shift-registers of network nodes generate 
rings; if FI = 0, they generate trees, where the 
meaning of FI is brought to each node with the 
reconfiguration instruction. 

However, different network structures depend 
not only on the value of bias B, and feedback 
input FI, but also on the type of the SRVB acti- 
vated in each node. 

To this end SRVB can be single and composrte. 
A stngle SRVB has a unique feedback gate FBG, 
which connects its MSB with LSB. A composcte 
SRVB is formed from k (k > 1) single shift-regis- 
ters each having a unique feedback gate, FBG.. 


Feedback Path 


FIGURE 


5 


& 


Composite SRVB 


For instance, Fig. 2 shows a composite shift- 
register with three feedback gates, FBG, , FBG. » 
FBG... 

Generally, in a shift-register with variable 
bias, each bit can broadcast its value via one 
of two alternative paths: a) a unique shi¢t-path 
when it is shifted left to the next more signifi- 
cant bit, and b) a unique feedback path; when it 
is sent right to some less significant bit. 

Activation of either a shift or a feedback 
path for each bit can be made by a special recon- 
figuration code RC stored in the reconfiguration 
instruction that performs reconfiguration into a 
given network structure. This instruction also 
brings to each node the same bias B that forms 
position code of the CE* identified with node N* 


that succeeds node N in a given network structure. 


The same bias B received by PE of node N is con- 
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Network Structure Generated by 
Composite SRVB of Fig. 2 
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ceived of as an address of the instruction stored 
in local ME that initiates a subroutine of com- 
munication between node N and N*. 

For instance, if the reconfiguration instruc- 
tion stores bias B = 010111 and reconfigures the 
shift-register, SRVB, of each network node N into 
a composite one shown in Fig. 2, then the network 
Structure formed is shown in Fig. 3. As seen, it 
consists of a 6-root star and a 2-root star. 

For instance, composite shift register parti- 
tions N = 60 = 111100, into bebyb3 = 111; boby = 
ie bo = 0. Bias B = QIOII1 is also partitioned 
into BB AB, = 010; BoB, = 11; and By = 43 
fore the composite shift register generates the 


following successor N* of node N,,~: a aya. = 


There- 


ALT], © 010 = 111@ 010 = 101; aoa, = ILT0], @ 


11 = 00@ 11 = 11; ay = 10], © 1 =O0Q@1=1, 
giving N* = 101111 = 47. 

Similarly, one can obtain any other single 
successor N* of the given node N. As follows, 
reconfiguration into the structure of Fig. 3 
1s performed during the time of one 1-bit shift 
and mod 2 addition executed concurrently by all 
the network nodes that receive the same bias B = 
O10111 and the same reconfiguration code RC that 
reconfigures each SRVB into the composite register 
Shown in Fig. 2. 


4.3. Contribution to the Ongoing Research 

The contribution of this paper to current 
State-of-the-art on network reconfiguration is 
two-fold: 1) It devises original, simple, and 
elegant techniques on network reconfiguration into 
the structures that proved to be convenient for a 
large class of computational and control algo- 
rithms. The time for such reconfigurations 
approaches the theoretically minimal boundary. 
2) It further expands a shift-register theory 
described in [13-17] as follows. 

In the literature the shift-register studied 
is shown in Fig. 4. Here each circle marked with 
B. means connection if B. = | and disconnection 


if B. = 0. Thus B = (Boy: ae By) is conceived 
of as the same bias as was introduced above for 


the shift-register SRVB. The difference between 
these two registers is: Fig. 4 shows a Linear 


Current 
State N 


Next 


State N* 


an-2 a9 


FIGURE 4 


Linear Shift-Register 


Shift-negtster which broadcasts to each mod 2 
adder the meaning of its MSB provided B. =]. 
the linear shift-register each next state N* 
generated can be obtained via matrix multiplica- 
tion N* = NeA where N is a current state stored 
in bits bi), by_os «++» bys and A is the canont- 
cak shift-negister matrix given below: 


In 


Bnet nee geste? Pa. EE BG 

i, 0 © wa © 

0° 1 0 0 0 0 

0 0 1 0 0 0 
A = 

0 0 0 ws. 1 0 0 

2 OO aw @ FT 0 


For instance, if bias B = 101] and the cur- 
rent state N = 1100, then the next state N* to be 
generated by a linear shift-register is N* = 1100> 

1011 
1000 
0100 
0010 
by the SRVB storing the same N and B (assume that 
gate FBG is set since for linear shift-register 
its MSB is also fed back to the LSB): N* = 
1[1100], © 1011 = 1001 ® 1011 = 0010. As follows 


linear shift-register and SRVB generate different 
next states for the same current state N and bias 
B inasmuch as in SRVB each mod 2 adder receives 
bit B; rather than MSB of the register if B. = | 


as is the case for the linear shift-register. As 
a result, different structures of trees, stars, 
and rings are generated by these two types of 
shift-registers. In particular, a fundamental 
property of a linear shift register is that it 
always generates next state N* = 0000 if a cur- 
rent state N = 0000, i.e., 0 always generates a 
cycle of period 1 since N* = 0*A = 0. On the 
other hand, SRVB maps O onto bias B, i.e., if N 
= 0, it is succeeded by N* = B. This means that 
if bias B # 0, then 0 is a node of a network 
structure other than cycle of period 1. 

For the linear shift-register this is funda- 
mentally impossible. For instance, for the 
Single circular SRVB receiving B = Olll, 0 
belongs to the following ring of period 8: 
{0,7,9,4,15,8,6,11}. If this shift-register 
stores N = 0101, it generates another ring of 


= 0011. Find the next state N* generated 
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period 8: {5,13,12,14,10,2,3,1}. Indeed, if 
N= 5, it is succeeded by N* = 1[0101]@ 0111 = 
1010 @ 0111 = 1101 = 13, etc. This network struc- 
ture cannot be obtained with 4-bit linear shift 
registers no matter what bias B is selected, since 
linear shift-registers always map 0 onto 0. Thus, 
the remaining 15 nodes cannot be formed into 2 
rings of period 8 each since this will require 16 
nodes. 

Hence, the network structures generated by 
SRVB and linear shift-registers are not equiva- 
lent. Furthermore, a fundamental drawback of a 
linear shift-register is that the techniques for 
finding the network structures that can be gener- 


ated are very laborious and complex, since they 


are based on finding the periods of polynomials 
over Galois field [13-16]. The complexity of 
these techniques grow exponentially with an 
increase in n, the number of bits in a shift- 
register. However, for complex multicomputer 
networks having a large number of nodes the size 
n of a code that identifies each node may become 
significant (n = 10 and more). Thus it becomes 
prohibitively difficult to utilize elegant 


results of linear shift-register theory in order 


to tabulate different cycles and trees that may 
be generated in an n-dimensional binary space 
with the use of linear shift registers. As for 
stars, linear shift-registers can generate no 
stars by definition. 

On the other hand, all the network struc- 
tures generated by SRVB (single and composite) 
can be described with very simple formulas that 
can be used by the programmer performing various 
reconfigurations in the multicomputer networks. 

As will be shown in this paper, complexity 
of the techniques remain constant and does not 
depend on n, the size of the position code N. 
Thus, these techniques are applicable to complex 
multicomputer networks, inasmuch as they allow 
obtaining simple and fast reconfiguration 
algorithms and simple descriptions of various 
network structures that can be generated in the 
network. | 

The only area of equivalence among linear 
Shift-register and shift-register with variable 
bias is when bias B = 0. If B = 0, both registers 
generate either the same binary tree with the 
root R = Q, or both are transformed into a citcu- 
kating shift-negister whose structure has been 
extensively studied in the literature [16]. 


9. NETWORK RECONFIGURATION 


If an application program needs a new net- 
work structure for execution of its tasks, it 
contains global or local modification of the 
reconfiguration instruction, RIN, where a global 
modification establishes the same type of data 
exchange for all network transitions, N > N*, 
whereas a local modification of RIN allows dif- 
ferent data exchanges for various network transi- 
tions. RIN can be executed in an array or even 
in a single CE. It stores the following codes: 
1) Code RR of requested resource which determines 
whether or not a requested resource is ready for 
reconfiguration. 2) Reconfiguration code, RC, 
that reconfigures the shift-register, SRVB, of 
each requested network node N, into the type that 


generates the required network structure. 3) The 
bias B, which allows each shift-register, SRVB, 
reconfigured by the RC code to generate position 
code N* that succeeds N in the given network 
Structure. 4) Program user code, NP, that is 
used in the priority analysis, aimed at determin- 
ing the priority of the program to perform net- 
work reconfiguration. 5) For global modification 
of the RIN instruction it stores the code of 
exchange, COE, provided all requested nodes will 
maintain the same type of exchange (PE-ME*, 
PE-PE*, ME-ME*, or ME-PE*). Each CE that 
receives reconfiguration code RC, bias B, and 
code of exchange, COE, performs the following 
Steps. ; 
| Step 1. It sends RC to its shift-register, 
SRVB, to reconfigure it into the type (single or 
composite), that generates the required network 
Structure. The bias B is sent to this SRVB to 
generate the position code of the network node N* 
that succeeds N in this network structure. 

Step 2. The bias B is used as the base 
address of the task that begins execution in the 
network structure. As a rule, bias B stores a 
jump instruction which performs jump to another 
location of the local memory, ME. 

Step 3. For the global modification of the 
RIN instruction, the code of exchange, COE acti- 
vates a needed data path between network nodes N 
and N*, where N* was formed during Step 1. 

Such an organization of RIN allows very fast 
reconfigurations into the network structures 


which proved to be very efficient for computation. 


The time of these reconfigurations approaches the 
absolute minimum due to the following reasons: 

a) Concurrent 1-step reconfiguration algorithms 
in which the entire network reconfiguration is 
made in one step by all network nodes during the 
time of 1-bit shift and mod 2 addition. b) 
Minimal time required to establish each data path 
between two network nodes, N and N*. 


6. TYPES OF SHIFT-REGISTERS 
This section will introduce the techniques 


for describing various types of shift-registers, 
SRVB. 


Feedback from bo 


Feedback from b3 


Feedback from b, 


6.1. Arithmetic Formats 


Each composite SRVB will be described with 
an antthmetic format, AF = [k,; Kies: k J; 
where k. is the size of each site shift-regis- 
ter contained in SRVB. Qbviously n = k, + k, + 
. + k. where n is the size of the SRVB. 


Since each single shift-register of the 
arithmetic format AF may perform either circular 
Shift provided that the feedback input FI = 1 or 
non-circular shift provided FI = 0, the arithme- 
tic format AF may be divided into the following 
categories: a) Circular AF,, when all its single 
shift-registers perform cirdular shifts; b) 
Non-cincukanr AF 9» when all its single shift-regis- 


ters perform non-circular shifts; c) Mixed AF 0° 
when single shift-registers described by it per- 
form circular and non-circular shifts. 

It will be convenient to represent mixed 
AF 49 as a combination of circular and non-circu- 
lar AF, j.e., AF 19 = AF, X AF 9s where AF, 


includes all circular single shift-registers and 
AF, includes all non-circular ones. 
For instance if A, = [3,; 41; Sy 251: then 


10 
A, = [4, 5] and Ay = AB 2s. ees Avo = A x Ay: 


6.2. Reconfiguration Code 


* 3 


Reconfiguration of the SRVB into any given 
arithmetic format will be performed with the re- 
configuration code, RC. RC is stored in the 
reconfiguration instruction and described as 
Fol vows: 

It is (2n-1)-bit code where n is the size of 
each SRVB. It consists of (n-1) 2-bit zones, Zi, 
each including two bits, S; and F., and one |-bit 
zones, Ly including only one bit, Foe Thus RC = 
CLs Z.2? oee 9 Z> Z)): 

Each zone Z. encodes, respectively, feedback 
and shifting paths for the two bits b; and Da 
of the SRVB, where b. is more significant than 
bs] LFid.- &). 


Feedback from bo 


FIGURE 5 


Shift and Feedback Paths in 4-bit SRVB 


For each zone, Z; = (FiS.), the values of F. 
and S; Show what type of path is activated for 
every pair of consecutive bits, b. and b._4- If 
F = 1], bit b. receives circular feedback informa- 
tion; and it receives no shift information from 
the next less significant bit, b; ,. If F = 0, 
bit b, receives either no feedback, or it 1S non- 


circular feedback (for trees and stars) 


| Bit S. = 1 of zone Z, stands for left shift 
from b._, to b. and F = Q stands for no shift 
from bs Therefore, together Fes S. show what 


type of path is activated between b. and b. : 
shift path (S. =], F = 0) or feedback path to 
b. and no Site from b, 242 >. = = 0, r= =0v i. 
Since S. = ] neane that bits b. and b. 
belong to the same shift register and S. = "9 


means that they belong to two different shift re- 
gisters, each S. is sent to activate a new feed- 


back path initiated in bs Likewise, each F 
is sent to the feedback path initiated in b. 
either to maintain it if S. = 1 or block unwanted 
transfer of oF to less significant bits of shift 
register if F = Q. 

Example. In Fig. 6, if 54 = ‘15 DITS bs and 
bs belong to the same shift register and S, = 1 
maintains the feedback path initiated in b.. At 
the same time, S, = 0 blocks b, from initiating its 
Own feedback path. 

If 53 = Q, b3 and b. belong to two shift 
registers. Thus 54 = | initiates a new feedback 
path from b, and > = 0 blocks unwanted transfers 
of b, to other less significant bits, etc. As 


follows, selection of the RC code can be formal- 
ized and described with a very Simple algorithm 
that is not introduced in this paper. 


~| 


7. SINGLE NETWORK STRUCTURES 


The objective of this section is to outline 
the ways for solving the following problem: 

Given bias B and an arithmetic format AF = 
[k,, kos ...> kK]. Find the network structure 


that is generated. 

The solution of this problem will allow a 
programmer to select bias B and reconfiguration 
code RC and obtain all the network structures 
that are needed. 

Before attacking a general case of arbitrary 
AF consider the so-called single network struce- 
tures produced by single shift-registers, i.e., 
those identified by AF = [n]. 

These can be of two types: rings and trees, 
specified by circular and non-circular arithmetic 
formats, respectively. Rings will be described 
first. 


7.1. Single Ring Structures 


A stngke rtng structure, SRS, is a set of 


rings that is generated by single shift-registers. 
available in network nodes. To define SRS means 


to define the following: 


a) A set of periods, SP = {T}, where T is 
the period of a ring generated in the SRS, and 

b) The number D(T) of eu onan the 
same period, T. | 


Therefore, we apa SRS as: SRS = 
{D(T): TeSP}. 
Palisa 


Set of Periods for Single Ring Structure 
The set of periods, SP, is completely speci- 
fied by the bias B: namely, how many ones are in 
B--odd or even. Before introducing this result we 
will make some definitions. 

By the weight, W, of the bias B we mean the 
number of ones it has. We say that bias B is even 
if its weight is an even number and B is odd if 


its weight is an odd number. 
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bias, B. 


Let SD, be a set of divisors for number n 


and SDop = Do, - SD, where (-) is understood as 
a set subtraction. 

For instance, for n = 6, SD, = {6,33,2,1} and 
SD. = SD, - SD, = {6,3,2,1} - {3,1} = 16,2}. As 
follows SD. may be specified only if p is even. 


With this in mind let us introduce one theorem 
that specifies the set of ring periods, SPs Tor 
Single ring structures. 
Given: Circular single arithmetic format 
= [n] and Bias B fed to each shift-register. 
Theorem 1. If bias B is even, then SP = 
SD.3; if bias B is odd, then SP = SD, = SD, 
ean n n (2) 
N Example. For the SRVB in Fig. 6, specified 
with arithmetic format AF, = [3 ], bias B = 001 


is odd, since its weight 1 is an odd. number. 


Thus a set of periods SP SD, , 3 = SD, = = SD, = 
SD. = {6,2}. If the same shift-register is fed 


with an even bias (B = 000, 011, 110, 101) then 
the set of ring periods, SP = SD., = {3,1} (Fig. 7). 


| 


7.1.2. Number of Rings with the Same Period 
As was seen, the set of ring periods, SP, can 
be found very easily. It is either SD, or SD, 


for AF, = [n]. Similar simplicity is associated 
with the formulas that find D(T), the number of 
rings having period 7, where T is a member of SP. 


Theorem 2. Ina single circular shift-regis- 
ter, AF, = [n], fed with an even bias B, for any 


period, T € SP, ! = > T'*D(T') (3) 
TV SD- 
As follows from (3), this formula is recur- 
Sive since for any ring period, T, one can find 
the number, D(T), of rings with period T, only 
after finding D(T') for all divisors T' of T. 
Example. Given AF = [6] fed with an even 


Using Theorem 1, one oe that the set of 
its ring periods is SP = SD, = {6,3,2,1}. Using 
Theorem 2, apply the recursive procedure for i 
ing D(T) for any T « SP. Start with T= 1, 2 


] FI=] 


FIGURE 6 


r 
a. Single SRVB with AF, = [3] and B = 001 


b. Single Ring Structure 
Generated by This SRVB 


D(1) = 2; for T= 2, 2° = 1D 


is a known value. 2¢ = 2 


. 

= 1; for T= 3, 2° = 1*D(1) + 
and = (8-2)/3 = 2; for T= 6, 26 = 
2*D(2) + 3D(3) + 6*D(6) and D(6) = (2°-1+D(1)-2s 
D(2)-3-D(3))/6 = (2©-2-2+1-32)/6 = (64-2-2-6)/6 
=9. Thus we found that the single ring structure, 
‘SRS, generated by this shift-register is: SRS = 
1201310 2)52(3)59(6) }. 

Similar simplicity is associated with find- 
ing the numbers of rings with period T generated 
by shift-registers receiving an odd bias B. 

Since the set of ring periods, SP = SD, = SD, - 
SDs where n is the size of shift-register, then 
we may establish the following Theorem 3. 

. Theorem 3. In a single circular shift-regis- 
ter, AF) = [n], fed with an odd bias B, for any 


ring period T e SD 
én 


gl/e . > Tp) (4) 


This is also a recursive formula since one 
can find D(T) only after finding D(T') for all 
periods, T' ¢€ SD. The recursive process starts 


with D(T) where T = 2° and s 2 1, because for T = 
os oe = SD, - SD s_1 contains only one member 
Pe Hee SD te 

28 


TDC); 

and D(1) 

= (4-2)/2 
D(3) 
2) + 


Example. Given shift-register with circular 
arithmetic format AF, = {6}, fed with an odd bias 


B. Using Theorem 1, one obtains that the set of 


its ring periods SP = SD, = SD, 5 a SD. = 


a.. 
N = 
N* = 0 1 ] 
b. 7 0 5 4 3 
2 6 
FIGURE 7 


Single Ring Structure Generated by the 
SRVB with AF = [3] Receiving B = 101 


{12,6,4,3,2,1} - {6,3,;2,1} = {12,4}. Using 
Theorem 3, one first finds D(T) for T = 4 as 4° 


D(4) = 2° giving D(4) = 15 next 12-D(12) + 4+ 


D(4) = 2© = 64 and D(12) = (2°-4-D(4))/12 = 
(64-4)/12 = 5. Thus, single ring structure, SRS, 
generated by this shift-register is: SRS = {1(4), 
5(12)} 


7.2. Single Tree Structure 


As was indicated above a single shift-regis- 
ter with non-circular arithmetic format generates 
a single-rooted binary tree (Fig. 8). We will 
call a single-rooted binary tree generated by a 
single non-circular SRVB with arithmetic format 
AF. = [n] a single tree structure, STS. 

0 As was shown above, to minimize the time of 
reconfiguration, an adopted direction of succes- 
sion is from the leaves to the root, R, which 
then succeeds itself by forming a cycle of 
period |. 

For tabulation purposes, we will use the 
following symbols for different single tree struc- 
tures: if STS is generated by non-circular shift- 
register with arithmetic format AF g = [nj, then 

> 


STS = [n,1], shows that the tree is single and 
has n levels, and the root R succeeds itself by 
> 


forming a cycle of length 1, 1.e., i 

For instance, the STS of Fig. 8 is described 

> 
as STS = {3, 1}, since this tree is single, it. 
has three levels, and its root, R= 7, forms a 
> 

cycle of length 1, i.e., 1; because 11111], © 001 
= 110@ 001 = 111. 
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Abtract -- In this paper we consider the de- 
sign of a Hierarchical Multiprocessor (HMp) for 
general-purpose applications. The main attribute 
of the HMp is the simplicity of the interconnectim 
network. The HMp consists of clusters of proc- 
essors connected hierarchically for both data 
processing and daba distribution. 


There are two levels of interprocessor 
communications in the HMp, an implementation of 
which is developed on the basis of the monitor 
conecpet. Using queueing network mddels, the per- 
formance falloffs due to shared hardware is also 
analyzed, and the optimum number of processors 
in each cluster is then determined. 


1. INTRODUCTION 


In the past few years multiprocessor architec- 
tures have gained considerable attention due to the 
availability of the powerful but inexpensive micro- 
processors and memories in the computing arena. 

The question that still remains to be answered sat- 
isfactorily is whether the microprocessor can be 
utilized as a building block for large general-pur- 
pose computer systems, thereby achieving a higher 
performance/price ratio as canpared to traditional 
uniprocessor architectures. A survey of existing 
multiprocessor organizations can be found in [1]. 
The unsolved issues associated with multiprocessors 
are also well discussed in [2]. 


The proposed architecture called the hierarch- 
ical multiprocessor (HMp) has been considerably 


influenced by both the Cm* architecture at Carnegie- 
Mellon University and the Hierarchical Multicomputer 


Organization at State University of New York, 
Stony Brook. 


The central idea in Cm* [1,3,4] is the group- 
ing of processors into clusters and the concept of 
a task force [5,12] which is ideal for a cluster 
organization. The main drawback in Cm* however is 
the integration of the I/O units into the system. 
The I/O units are made dependent on individual pro- 
cessors which results in an unstructured operating 
system and gives rise to reliability and utiliza- 
tion problems. This to some extent has been solved 


by the Hierarchical Multicomputer Organization [6,7] 


This work is partially supported by the National 
Institute of Justice, U.S. Department of Justice, 
under Contract J-LEAA-O014-78, and the National 
Science Foundation under Grant EnG-7911347. 
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where the idea of separating the control and data 
moving functions has been suggested. In the pro- 
posed architecture this idea has been extended to 
include the users interface to the system. 


The HMp has been. designed to minimize the 
interconnection complexity of the system and uses 
only a few types of functional units as building 
blocks for the system. The aim of the design 
has been to create a general purpose multiproc- 
essor with no restriction on the types of algo- 
rithms which it can exploit. 


This paper is organized as follows. Section 
2 discusses the HMp architecture of the multi- 
processor in same detail. Section 3 describes 
the structure of the kernels necessary to imple- 
ment monitor primitives [9,10] for synchronizatim 
purposes. Finally Section 4 deals with the per- 
formance falloffs due to shared hardware re- 
sources and analyzes the performance of the sys- 
tem in terms of the processing rate. Conclusion 
follows in Section 5. 


2. ARCHITECTURE 
2.1 Overview 


A multiprocessor should be able to exploit 
the explicit or implicit parallelism given by an 
algorithm. This is possible mly if the number 
of steps in each parallel path is greater than a 
fixed minimum so as to offset the communication 
overhead existing between interacting tasks. 

Thus the extent of exploitable parallelism de- 
pends on the communication overhead between inter- 
acting processes. The hardware interconnection 
which has the lowest associated communication 
overhead is the shared memory concept. The re- 
striction of this approach is that the communica- 
tion overhead increases as the number of proc- 
essors in the system increases. 


To circumvent this problem, a system with 
two levels of communication is developed. At the 
first level of communication the communication 
time is kept to a minimum and independent of the 
total number of processors in the system. At 
the second level of communication the canmunica- 
tion time is sacrificed for extensibility and 
hardware interconnection costs. The processors 
in this architecture are grouped into clusters. 


The significance of this approach becomes 
more evident when we examine the property of 
process locality [2]; which states that inter- 


action within a defined group of processes is 
frequent, whereas interaction between different 
groups is infrequent. If processes are allocated 
to processors such that the processes of the same 
group reside in any single cluster, then the 
communication overhead would Coenen to that of 
a closely oupec? system. 


The HMp consists of two hierarchies, namely 
the processing hierarchy and the data distribu- 
tion hierarchy. The data distribution hierarchy 
handles the file management functions and the 
processing hierarchy handles the processing func- 
tions. To differentiate the processors in the 
processing hierarchy from those of the data 
distribution hierarchy, the former are referred to 
as the P-processors and the latter as the on 
cessors. 


The processing hierarchy consists of pro- 
cessing modules grouped into clusters which are 
then organized in a hierarchical fashion. Associ- 
ated with each cluster of processors in the pro- 
cessing hierarchy is a parent processor which is 
part of the cluster one level higher in the hier- 
archy. Each cluster in the processing hierarchy 
has associated with it a D-processor. The D-proc- 
essors of the system with the secmdary memory from 
the data distribution hierarchy. The cluster 
organization is presented first and then the 
system organization is described in some detail. 


2.2 Cluster Organization 


The cluster consists of processing modules 
which have a sibling relationship to each other 
and they share a common memory by means of a time 
shared common bus (Fig. 1). Conflicts of access 
to the common bus are resolved by the bus arbiter, 
and the handshaking required for gaining control 
of the bus is done by the switch, which is a 
subsystem of each processing module. Each proc- 
essing module in the cluster consists of a proc- 
essor, local memory, a swtich, a DMA interface 
to the D-processor and serial links to its child 
and parent processing modules. 


2.2.1 The Switch. The processor does not 
distinguish between accesses to common memory and 
its local memory. It is the responsiblity of 
the switch to recognize a nonlocal reference and 
initiate the necessary handshaking to perform 
the memory access. To access common memory, the 
switch has to gain control of the common bus by 
handshaking with the arbiter. The switch has 
been given the capability of buffering a single 
data word which has to be read from or written 
into the canmon memory. Also for ease of imple- 
menting process synchronization primitives the 
switch has been given the capability of requesting 
the control of the bus at two levels, depending 
upon the status of the switch (Section 3 will deal 
with this in some detail). This status is explicit- 
ly set by the processor and is alterable only by 
the processor. 


2.c.c The Bus Arbiter. The bus arbiter is 


moderately complex since it can grant control of 
the bus at two levels and there are certain rules 
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it has to follow in order to preserve the integ- 
rity of the interprocess synchronization primi 
tives (This will be discussed later). The arbiter 
provides a round robin service to requesting 
processors to ensure that all requests will be 
honored in due time. Each of the switches has 
two individual request lines to the arbiter for 
requesting control of the bus at the two levels, 
and correspondingly there are two grant lines 

to each switch, 


2.2.3 The Control Links. Since the cotrol 


links are serial in nature, we need additional 


processing at both ends of the link for buffer- 
ing a message, generating interrupts and setting 
up flags at the completion of a message transfer. 
A parent processor can interrupt its child 
processor through the serial control link at two 
levels: one level is maskable and the other is 
nonmaskable. An interrupt at any of the two 
levels will cause the child to execute a message 
receiving routine which is a part of the kernel 
software. In normal operation a parent interrupts 
its child at the maskable level. This implies 
that if the child is inside the kernel, the inter- 
rupt will remain pending until the child exits 
from the kernel. But if the parent has reason 

to believe that a malfunction has occurred, it 
interrupts at the nonmaskable level. The child 

on the other hand can interrupt its parent through 
the serial control link mly at the maskable level. 
This ensures that the parent can still function 
with a faulty child processor. 


2.2.4 The DMA Interface. The DMA interface 
transfers blocks of code/data to and from the 
local memory of the D-processor associated with 
the cluster and the local memories of the process- 
ing modules. The DMA interface is also used in 
setting up code/data in the common memory of a 
cluster. To start a block transfer, the parent 
processor of the cluster gives the order to the 
D-processor including the identity of the file, 
processor number, starting address, the length 
of the block and the direction of transfer. The 
D-processor then sets the address registers 
and the word count register of the DMA interface 
and initiates the transfer. On completing the 
transfer the DMA interface informs the D—processor 
which in turn informs the parent processor of 
the cluster. 


2.3 Data Distribution Hierarchy 


For each of the clusters in the processing 
hierarchy there is an associated D-processor 
which handles the transfer of code/data into or 
out of the cluster. Since most of the processing 
is done at the bottom level of the processing 
hierarchy, most of the file transfers in the sys- 
tem will be handled by the associated leaf D- 
processors. Thus we need high capability data 
links between the secondary storage units and 
the bottom level D-Processors of the data dis- 
tribution hierarchy. To perform the file manage- 
ment functions of the system, the D-processors 
need to exchange short control messages between 
themselves. The D-processors are interconnected 
hierarchically by means of serial links and since | 


at times there will be file transfers on these 
links, a packet switching communication system 
has to be implemented. 


All the human interfaces to the system are 
connected to the data distribution hierarchy and 
so it acts as the source of all tasks which need 
processing power from the processing hierarchy. 
New processes enter the processing hierarchy via 
the serial control links interconnecting the two 
hierarchies and the results enter the data dis- 
tribution hierarchy in the same way. The D-proce 
essors act as canmand message interpreters in the 
same sense as the 'shell' of the UNIX system [11] 
and create processes which execute the command 
message in the processing hierarchy. 


2.4 Root Cluster Organization 


The two hierarchies of the system namely the 
processing hierarchy and data distribution hier- 
archy are merged at the top by a root cluster 
whose organization is slightly different fran 
that of the other clusters in the system (Fig. 2) 
The root cluster consists of both P-processors 
and D—-processors sharing a common memory. The 
processing hierarchy is attached to the P-proc- 
essors of the root cluster and the data distribu- 
tion hierarchy is connected to the D-processors 
of the root cluster. 


Tasks of the operating system executing in 
the root cluster can oversee both the processing 
and the data distribution hierarchies. Typically 
these tasks would consist of cooperating parallel 
processes and since the processors in the root are 
tightly coupled, it leads to an efficient imple- 
mentation. 


3. SYNCHRONIZATION AND INTERPROCESS 
COMMUNI CATION 


For any multiprocessor architecture it is 
essential to have an efficient implementation of 
the synchronization and interprocess communication 
primitives. Microprocessor architectures being 
introduced at present have capabilities to support 
two execution modes, features for memory pro- 
tection and hardware support for task switching. 
These hardware supports simplify the implementa- 
tion of efficient primitives. 


3.1 General Approaches 


Synchronization and interprocess cammuni ca- 


tion can be implemented by using any of the follow- 


ing mechanisms: semaphores, mailboxes, message 
queues or monitors. Each of these mechanisms is 
logically equivalent to the other. 


From a software point of view, monitors [5] 
are an ideal solution since they help in specify- 
ing the precedence relationships in a structured 
fashion. Monitors consist of shared data and 
procedures which operate on the shared data. A 
process can operate on the shared data only 
through the procedures of the monitor and not 
directly. Since only one process can be inside 
the monitor at any time operations on the shared 


143 


data are mutually exclusive. The primitives re- 
quired to support monitors are: entering a moni- 
tor, exiting amonitor, signalling a condition 
and waiting for a condition [10]. 


Since we have interprocess communications at 
two levels: 1) between processors in the same 
cluster and 2) processors in different clusters, 
we will first discuss the implementation of the 
primitives at the cluster level and then at the 
system level. 


3.2 Synchronization at the Cluster Level 


To be as general as possible we assume that 
there can be more than one process assigned to a 
Single processor at any time and that the imple- 
mentation should handle both static and dynamic 
creation of tasks. 


To limit the loading onthe central resources 
of the cluster (i.e. the common memory, the 
common bus and the parent processor), we decided 
to define two kernels; the processor kernel (call- 
ed the P-kernel) and the cluster kernel (called 
the C-kernel). The P-kernel resides in each 
processor and manages the processes residing in 
that processor. The C-kernel handles the monitors 
of all processes residing in that cluster and is 
located in the common memory of that cluster. 
Since the kernels handle the system queues, they 
themselves should not be interrupted to assure 
that no race conditions develop. This is easy 
to implement for the P-kernel since on entering 
the kernel it can disable all interrupts (includ- 
ing the interrupts from the parent processor). 
But mutual exclusiveness for the C-kernel has to 
be implemented by using additional hardware. 
This mutual exclusiveness is taken care of by the 
arbiter and is discussed later. 


The C-kernel provides mutual exclusiveness 
of the monitors by associating with each monitor 
a flag which récords whether the monitor is busy 
or not. Thus the C-kernel provides a means of 
having more than one monitor busy at the same 
time. The C-kernel maintains the queues for 
processes waiting to enter a monitor and queues 
for each condition. The P-kernel queues contain 
the full status of the processes necessary to re- 
start the processes whereas the queues in the C- 
kernel contain only minimal information to 
identify the processes. This is to ensure that 
the loading on the central resources is as minimum 
as possible. 


The tasks running in the processors of the 
cluster are in the user mode, and execution of 
any of the synchronization primitives causes a 
trap to the P-kernel of the processor. The 
P-kernel saves the status of the user process and 
then tries to enter the C~kernel and waits if 
busy until it is free. This does not load the 
central resources but only idles the processor. 
Once the C-kernel is free, the P-kernel enters 
it and performs the operations corresponding to 
the desired primitive operation. It should be 
again stressed that the C-kernel can be entered 
only through the P-kernel and not directly by the 


user processes. The operations done after enter-— 
ing the C~kernel for the case of me primitive is 
discussed and the rest are similar. 


Exiting from a monitor: The C-kernel checks 
‘the queue associated with the monitor and if there 
is no process waiting to use the monitor, it resets 
the monitor flag and exists to the p-kernel. If 
hwoever, there is a process waiting, it sends | 
the identification of the waiting process to the 
parent processor (to be woken up) and then exits 
without resetting the monitor flag. The P-kernel 
then passes control back to the user process. 


When the parent processor receives the message 
for waking up a process, it: interrupts the processor 
which has that process in its wait. queue. and thus 
we have a "positive wakeup of the process" [8]. 

The parent processor does not require to keep 

track of where the process is residing since the 
message from the C-kernel cmtains both the 
identification of the process and the physical pro- 
cessor in which it is residing. 


For the case of dynamic creatim of processes, 
the technique used is’ quite similar. ._Executim 
of a FORK statement by a user process causes a 
trap into the P-kernel and the P-kernel then re- 
quests the parent processor to create the re- 
quired number of processes. It is understood that 
any data to be shared has initially been stored 
in the cammom memory of the cluster when the task 
itself was allocated. Once the Parent processor 
acknowledges the message, the P-kernel gives cm- 
trol back to the user process. 

3.2.1 Functions of the Arbiter. Function 
of the arbiter are: 


1. to give mastership of the bus to a requesting 
processor, and 


2. to keep track of the cmdition of the C-kernel 
(i.e busy or not) and thus provide mutual 
exclusion of the C-kernel. 


Each processor can request use of the bus at two 
levels, one for using the C-Kerne1 and the other 
for using code /data outside the C-kernel (i.e. 
the monitor procedures). This is implemented by 
using an independent set of two request and two 
grant lines for each processor. | 


If the processor wants to enter the C-kernel, 
it sets a status bit in the switch of the pro-. 
cessor module. The switch then asserts the C- 
request. line and if the C-kernel is not in use, 
the arbiter asserts the C-grant line. The switch 
then sets a flag indicating to the processor that 
it can now proceed to use the C-kernel. Then for 
each access to the canmon memory the switch 
asserts the B-request and performs the memory ac- 
cess after the arbiter asserts the B-grant line. 
Once the processor exits from the C-kernel, it 
resets the status bit in the switch which causes 
the switch to deassert the C-request line. If the 
processor wants to access code/data which is out- 
side the C-kernel, then the processor does not 
set the status bit in the switch. For each access 
to the comma memory the switch only asserts the 
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‘other processor in the C-kernel. 


Berequest line, 


The arbiter provides mututal exclusion of 
the C-kernel by asserting C-grant to only one 
of the processors which has its C-request line 
asserted and ignores all other requests for the - 
C-kernel until the corresponding processor exits 
from the C-kernel. The arbiter can give master- 
ship of the bus to a processor with only its B- 
request line asserted even though there is an- 
This does 
not -cfeate race conditions but does improve the 
utilization and availability of the time shared 
bus. Thus the arbiter provides a round robin ser- 
vice for the use of the bus (by asserting B-grant) 
and a round robin service for the use of the C- 
«kernel (by asserting C-grant). 


3.3 Synchronization at the System Level 


We have so far discussed the implementation 
of the inter-process synchronization primitives | 
at the cluster level. There can be two approaches 
for implementing these primitives at the system 
level. One approach would be to have processes 
residing in different clusters communicate via 
messages. This involves the complexity of having 
two types of communication primitives, one at the 
cluster level and the other at the system level. 
It suffers from the fact that the architecture 
is not transparent to the systems programmer. 


The second approach is to implement the. syn- 
chronizatim primitives .at the system level by us- 
ing the mmitor concept. This provides transparen- 
cy end makes it easier-for the system programmer to 
implement the system software. Since monitor pro- 
cedures access only data local to the monitor, 


all interactions between the calling process and 
the monitor procedure is made via arguments. Thus 
execution of a monitor procedure whose physical 
location is in another cluster can be implemented 
via messages. The monitor procedures will physi«.: 
cally reside at a common ancestor cluster of the 
two clusters in which the cammunicating processes 


are present. 


The basic kernel of the operating system which 
handles the processes and the inter-—process cammuni- 
cation is described below. This kernel code is 
replicated in all the P-processors of the HMp. 


“The basic kernel consists of essentially two levels. 


The first level cosists of the P-kernel and the 
C-kernel. The second level consists of the mes- 
sage handler which implements the primitives neces- 
sary for a process to switch processors. A process 
can execute a monitor procedure whose physical 
location is in another cluster by migrating to that 
cluster. The message handler can create, destroy 
or wake up processes residing in the processor. 
This is necessary for implementing the monitor 
primitives and also serves to implement the concept 
of coscheduling the task force [12]. 


Above this basic kernel, a distributed operat-. 


ing system such as Medusa [12] can be implemented. 
Medusa consists of a set of disjoint utilities 
(each of which is a task force) which communicate 


via messages using a structure called pipes [11]. 
This structure can be implemented by using the 
monitor primitives made available by the message 
handler. 


4, PERFORMANCE ANALYSIS 


Since the present organization consists of 
two levels; we first determine the performance 
of a single cluster treating it as a single 
unit. Using these results we evaluate the per- 
formance of the entire system. In this analysis, 
performance refers only to the throughput of the 
system and not to any other factors. 


4.1 Performance of a Single Cluster 


The resources which are shared by the pro- 
cessors of a single cluster are the time shared 
common bus, the canmon memory, the D-processor 
and the parent processor. Interference in shar- 
ing these resources results in a decrease in the 
performance of each processor. 


The reason for analyzing the cluster is to 
determine the optimum number of processors for 
a cluster and to find the limiting value of 
performance due entirely to hardware constraints. 
We are at present not comsidering the performance 
falloffs due to software precedence relationships 
which do affect the final figure of performance. 

The performance of a cluster of processors is 
being evaluated by using queueing network models. 
The first queueing network models the perform- 
ance falloffs due to common memory interference. 
The second queueing network models the perform- 
ance falloffs due to the parent processor and the 
D-processor of the cluster. 


4.1.1 Common Memory Interference. Let us 


define the memory cycle time (Mc) as the time 
taken to read or write a single word into common 
memory once the switch has mastership of the time 
shared common bus. Let us also define the access 
interval time (Ai) as the time interval between 
two consecutive accesses to memory by a pro- 
cessor. The accesses can be either for code or 
data. Even though there is a variation in the 
access interval times we assume for simplicity 
that it has a constant value [13]. 


Let us further denote the integer value 
[Ai /Mc] by m. The greater the value of m, the 
less the interference due to the shared re- 
source and thus greater is the performance of the 
processors in the cluster. If the common memory 
is implemented in bipolar technology and the pro- 
cessors in MOS technology then the value of m is 
usually in the range 3 to 10 and thus can be used 
as a design parameter. Using bipolar memories 
for the common memory is reasonable since the 
memory requirements for shared informatio is 
small. 


To analyze the interference in accessing the 
common memory, we should have an understanding 
of the nature of the stochastic process which 
describes the accesses to common memory by each 
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processor. Reviewing the use of common memory we 
find that the common memory is used only for moni- 
tor procedures, and their associated data and 
control mechanisms. When a processor starts exe- 
cuting a monitor procedure, all memory accesses 
will be to the cammon memory since both code and 
data will reside in the common memory. Thus 
successive accesses to common memory by the same 
processor cannot be modeled as independent random 
variables. 


If a processor executes any of the monitor 
primitives, it begins executing the code of the 
C-kernel and then, depending upon the type of 
monitor primitives desired and the state of the 
desired monitor, one of the following actions 
takes place. 


1. Processor starts to execute the monitor pro- 
cedure. 


2. It wakes up a process residing in another 
processor to execute the monitor procedure. 


3. It waits for another process to signal it and 
at that time it continues to execute the 
monitor procedure. 


4. It does not execute the monitor procedure nor 
does it wake up another process to execute 
the monitor procedure. 


Examining the above cases we find that for 
the first two cases the monitor procedure is 
executed either by the same processor or by an- 
other immediately following the execution of the 
C-Kernel. In the last two cases the monitor 
procedures are not executed, and the next time 
the processor accesses the common memory it would 
execute the C-kernel. Once a monitor procedure 
is being executed, the processor has to execute 
one of the monitor primitives to exit from the 
monitor. The above four cases can be condensed to 
the following two cases. 


1. The processor executes the C-kernel, then the 
monitor procedure and finally the C-kernel; 
each one immediately after the other. 


2. The processor executes the C-kernel and then 
starts executing code from its local memory 
and then the C-kernel when it cames across a 
monitor primitive. 


Both of these cases can be represented by one uni- 
fied model which is as follows: the processor 
first executes the C-kernel and then a monitor 
procedure and then code from its local memory and 
then finally the C-kernel again. 


4.1.2 Closed Queueing Network Model. We 


can model the memory contention problem as a closed 
queueing network model with appropriate service 
times and scheduling policies. The resource being 
shared is the common memory and the service time 
it provides can be measured in terms of the number 
of common memory accesses. 


The number of cammon memory accesses needed to 
execute a portion of a monitor procedure sand 
wiched between two consecutive monitor primitives 


can be treated as a random variable with an ex-— 
ponential distribution. The number of local 
memory accesses between two monitor calls can 
also be treated as a random variable with an 
exponential distribution. The nunber of cammon 
memory accesses needed to execute the monitor 
primitive by means of the C-kernel is assumed 
to have an exponential distribution. Even 
though the actual distributiong might not corre- 
spond to our assumptions, queueing models are 
generally robust and do give good results, 


The queueing network consists of three nodes 
two of which consist of parallel servers and the 
third a single server (Fig. 4). The first node 
consists of n servers where n is the number 
of processors in the cluster. The service time 
for these servers corresponds to the distribution 
of the number of local memory accesses between 
two consecutive monitor calls. Node lis of 
type-D [15] since the customers are delayed inde- 
pendently of other customers at this service 
center. 


The common memory can be treated as m vir- 
tual parallel servers since effectively there 
can be m common memory accesses in time period 
Ai. Also note that we cannot give more than one 
common memory access to a processor in a given 
time period Ai. (Fig. 5). Of the m virtual 
servers of the common memory one server services 
the C-kernel queue which is Node 2 of the queue- 
ing network. The rest of the m- 1 virtual servers 
service the monitor queue and form Node 3 of the 
queueing network. 


In the actual system however, the server 
servicing the C-kernel queue would service cus- 
tomers in the monitor queue if there are no 
customers in the C-kernel queue. Therefore the 
performance characteristics obtained by this 
queueing network model gives the lower bound of 
the actual performance. The upper bound of the 
performance can be easily obtained by having an 
additional parallel server at Node 3. 


The scheduling policy used for Node 2 and 
Node 3 of the queueing network is FCFS. In the 
actual system the type of scheduling used to 
service the monitor queve is round robin. As 
we are only interested in the mean values of the 
waiting time and the mean queue lengths, we can 
assume an FCFS service mechanism. As long as 
the scheduling is independent of the service re- 
quirements of a customer, the mean values do not 
change [1]. 


The analytical solution of the queueing net- 
work was carried out by the recursive algorithm 
in [15]. The results shown in Fig. 6 correspond 
to mean value service times indicated below 
(values normalized by the: mean number of local 
memory accesses needed to execute a block of 
local code sandwiched between two consecutive © 
monitor calls): 


1. Mean number of common memory accesses needed 
for executing the C-kernel: Case 1: 2.5%; 
Case 2: 5%. 
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“processor. 


2, Mean number. of common memory aecesses 
needed for executing monitor code . 
sandwiched between two monitor primitives: 
Case 1: 10%; Case 2: 20%. 


The results give the lower and upper bounds 
of the performance of the cluster with common 
memory interference for m= 3 andm= }h. 


4.1.3. Parent Processor Interference. To 
evaluate the type and frequency of demands placed 
on the parent processor, let us consider its func- 
tions. The parent processor basically consists 
of a message handler and other user or system 
processes. The message handler handles all in- 
coming messages from the child processors, the 
D-processor and the parent of the processor it- 
self. There are three types of messages which 
can occur and their description follows. 


The first type is a synchronization request 
between processes residing in the same cluster. 
The amount of processing time needed to process 
this type of message is small but their frequency 
of occurrence is high. The parent processor 
should be very responsive to these requests since 
any delay would entail a decrease in the perform- 
ance of the cluster. 


The second type of message is a request for 
execution of a monitor procedure residing in the 
parent processor cluster. This entails the 
creation of a process by the message handler by 
inserting its description in the ready queue of 
the P-kernel. The created process then needs 
processing time to execute the monitor. Then the 
message handler has to reconvert the process into 
a message form and send it back to the child 
The frequency of occurrence of these 
type of messages is small but the processing time 
needed is higher in relation to the messages of 
the first type. 


The third type of message is a request for 
the transfer of code/data into or out of the lo- 
cal memories of the child processor. This message 
should be redirected to the D-processor and once 
the transfer is over the reply from the D- 
processor should be sent to the child processor. 
The frequency of occurrence of these type of 
messages is low and the processing time needed 
is also low. 


We can now treat the 


4.1.4 Queueing Model. 


system as a closed queveing network model with the 


parent processor and the D-processor as servers 
with the processors in the cluster and the parent 
of the processor itself as the customers (Fig. 7). 
The processors are assumed to have a think time 
which is exponentially distributed. After each 
think period a processor sends a message to the 
parent processor which acts as the server. The 
message can be of any type and it is placed in 
the message queue. The service time requirements 
for customers in the message queue are assumed to 
be exponential. This service time includes the 
time taken to read the message from the hardware 
buffer; perform the synchronization or create a 
new process by entering it in the run queue of. 
the kernel. 


The customers coming out of the first server 
Gantake three paths where each path has an assign- 
ed probability. The three cases are as follows: 


1. If the customer needs no further processing, 
then it returns to the processor from which 
it originated (P21). 


2. If further processing is required, then the 
customer is put back in the message queue 
which is serviced by the parent processor 
(P22) 


3. If further processing is required from the 
data processor, then it is put in the data 
queue. After it receives service from the 
data processor, it is put back in the 
message queue (P23). 


Both the queues in the model namely the 
message queue, and the data queue are served in a 
FCFS discipline. Synchronization messages are 
being given higher priority in the queueing 
model since additional processing needed by a 
message is being postponed until the backlog 
of messages have been serviced. © 


Pee with the average service time for 
customers in the message queue determines the 
total service time requirements for messages 
which need execution of monitor procedures. The 
service time for customers in the data queue is 
the total time the system takes to transfer the 
code/data to or from the primary memory. This 
service time is also assumed to have an exponen- 
tial distribution. 


The analytical solution of the queueing net- 
work was again obtained by the recursive algo- 
rithm in [15]. The results shown in Fig. 8. 
correspond to the following mean value service 
times (the values are normalized by the service 
time for executing code residing in the cluster, 
sandwiched between two consecutive calls to the 
parent processor): 


1. Mean service time taken by the parent process@r 


to perform a synchronization request: 2.5% 


2. Mean service time for the D-processor to 
transfer code/data in and out of the cluster 
processor's private memory: 20% 


The routing probabilities for the closed queueing 
network was taken as follows: 


1. Case 1: Pel=0.8, P22=0.1, P23=0.1 


2, Case 2: P2l=0.7, P22=0.2, P23=0.1 


4,2 The System Performance 


From the analysis so far carried out we have 
to arrive at the figure for the optimum number of 
processors in each cluster. Since m, the figure 
of merit of the common memory can be varied within 
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a reasonable range; it can be varied such that 
the parent processor becomes the critical shared 


resource of the cluster. 


Assuming that we desire at least 90% of the 
ideal performance (i.e. when there is no inter- 
ference), we come up with the figure of 15 pro- 
cessors from the performance curves of the parent 
processor (Case 1). Since the parent processor 
is also a resource for the grandparent of the 
cluster, the optimum number of processors for 
the cluster would be equal to 1}. 


From this optimum number of processors and 
the curves for the performance falloffs due to 
the common memory interference we can determine 
the desired value of m, such that the critical 
resource is still the parent processor. Form 
equal to 3 and the number of processors equal to 
14 we have the performance due to common memory 
interference as 97% of the ideal case for Case l. 


The performance falloff when both the shared 
resources are present can be taken as the sum of 
the individual performance falloffs as a first 
approximation. Therefore the performance of the 
cluster with both the shared resources present 
will be 87% (=100-(100-90)-(100-97) of the ideal 
performance. Since the number of processors in 
the cluster is 14, the net cluster performance 


will be equivalent to that of 12 processors 
(14*87%) . 


Assuming that most of the actual processing 
takes place in the leaf clusters, the total sys- 
tem performance can be written as the product 
of the cluster performance (net performance) and 
the number of leaf clusters in the processing 
hierarchy of the computer system. The above 
analysis assumes that the P-processors at the 
higher levels are busy synchronizing and perform- 
ing other communication tasks. 


2+ CONCLUSION 


We have presented the architecture of the 
HMp, the synchronization primitives and finally 
the performance of the system based on these 
primitives. Our future work will concentrate on 
the interesting aspects brought up by this archi- 
tecture, some of which are given below. 


The HMp has an upper bound on the number of 
levels it can have. This depends on the higher 
level processors becoming the bottlenecks in the 
system. This is thus related to the number and 
locality of the inter-—cluster messages which 
further depends on the operating system structure. 


The effects of software precedence has to be 
introduced into our queueing models for determin- 
ing the actual performance falloffs. This will 
be useful in determining the effects of both soft- 
ware and hardware constraints in the system. We 
know by intuition that the figure for the optimum 
number of processors in each cluster will increase, 
when these effects are taken into account. 


The files in the secondary memory should be 
distributed such that the time to access them 
from any point in the processing hierarchy is a 
minimum. The modelling of such a system will in- 
volve the actual hardware being used and the band- 
width of the interconnections in the data distri- 
bution hierarchy. 


[13] 
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Summary 


A highly distributed data-flow processor 
based on block-driven principles is described. 
Being block-driven, data-flow programs can be 
executed in functional blocks. As a result, data 
transfers can be effectively separated into 
different levels of communication paths. Through 
the use of a structured computer architecture and 
a hierarchical data-transfer mechanism within the 
tree network, this data-flow processor provides 
programmable communication paths for fast data 
transfer while at the same time achieving very 
high levels of concurrency. 


Introduction 


In a data-driven computing system, the in- 
structions of a data-flow program are normally 
stored in the instruction memory cells; an in- 
struction cell will be fired whenever the required 
data are available [1.2]. Based on a variety of 
forms of parallel routing and parallel computation, 
a large number of such instructions can be exe- 
cuted simultaneously. However, due to data depen- 
dencies, this data-driven approach requires streams 
of data to be routed back to the instruction memory 
cells. These data then trigger newly addressed 
instruction cells. Based on the trend that data 
transfers are becoming more and more costly when 
compared to the cost of a unit of computing power, 
there is a critical need for data-flow structures 
having efficient data transfer paths [3,4]. Con- 
sidering the constraint of data dependencies to 
data-flow instructions, this constraint varies from 
one instruction to another. Some instructions re- 
quire one data item from the previous execution and 
some require two. An instruction with one depen- 
dent data item can be executed immediately after 
the previous operation is completed. However, the 
execution of an instruction with two dependent data 
items has to wait until both data items are ready. 
This results in a different degree of efficiency 
for transferring these two types of data. A num- 
ber of data-flow machines, which have been pro- 
posed recently or which have already been devel- 
oped, have paid little attention to this problem. 
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In these machines, data items are transferred on 
a fixed-length data path basis. As a result, the 
problem of intercell communication overhead has 
limited their potential to only a handful of 
applications [1,2,5]. 


In an effort to overcome the communication 
overhead problem, a block-driven approach is ex- 
plored. By contrast to the "data-driven", the 
"block-driven" can be best described as the group 
firing of instructions which belong to a group of 
composite computations. The result of this group 
firing is that data paths can be effectively sepa- 
rated. Data transfers among the already executed 
instructions and the succeeding instruction cells 
within the same group will have shorter paths 
compared to those outside the group. 


The purpose of this paper is three-fold: 
First, we describe the block-driven principle and 
discuss its potential advantages: second, we 
introduce an hourglass computing model to facili- 
tate different levels of data transfer, and by 
this model, develop a block-driven data-flow 
computer architecture; and finally, we present a 
push-pull data-transfer mechanism. 


Block-Driven Data-Flow Principle 


The flow of data and control in data-flow 
programs represent fundamentally a random motion 
phenomenon. And the requirement for transferring 
this tremendous number of data links and control 
signals has made the design of parallel processors 
extremely complex. Although a large number of 
algorithms have been designed to make applicative 
programs more suitable for parallel computations, 
they have been proved to be efficient only when 
they are executed in specialized array processors 
[6,7,8]. Like the structured programming techni- 
que being widely applied to the software design, 
the idea of structural, compound function compu- 
tations has been gaining a great deal of support 
recently [9,10]. The block-driven approach is 
developed here to exploit as much of the structure 
of data-flow programs and machines as possible. 
In what follows, we will describe this approach. 
Two important computational steps in data-flow 
programs are the branch computation and the joint 
computation. Branch computations involve a se- 
quence of chain computations with the constraint 


that each succeeding computation requires one data 
item from its previous computation. In other 
words, a computation branch is represented by a 
mathematical expression of the form 

fa a aah 
in which the f£,'s denote the elementary arithmetic 
function (+,-,%,/) and the a,'s are data. A joint 
computation is defined as a pair of branch compu-— 
tations which are linked together by a computation 
node at their ends. A data-flow graph containing 
two computation branches at a joint is shown in 
Figure 1. By such linkage, we can form a number 
of joint computations into a cluster of functional 
cells or a computational block. Each functional 
cell is made up of a group of composite instruc- 
tions and a number of data operands. Data items 
which are used to initiate the firing of a func- - 
tional cell are called global data. 


1, (+ °£, (ay sf, (a, 2a) °°) 


In a block-driven computer system, the firing 
of an instruction cell in the data-flow program is 
subject to the firing of a functional cell with 


which this instruction is associated; and the fir- 


ing of a functional cell is subject to the group 
firing of a computational block. Through this 
process, a large number of independent functions 
can be executed in parallel. Further, complex 
algorithms can be easily decomposed and EnereDy 
effectively executed. 


Joint Computation 


One important issue of having data-flow pro- 
grams executed by the block-driven process is con- 
cerned with the manner in which a computation is 
executed locally. To deal with this, we propose a 
simple and effective processing pair for branch 
computations. The processing pair consists of a 


pair of local supervisors and a pair of FLP comput- 


ing modules connected in a ring configuration (see 
Figure 2). Data-flow instructions on a pair of 
branches are executed on an interleaved basis. An 
example of this interleaved computation is illus- 
trated in Figure 3. The advantages of this ap- 
proach are: first, the succeeding instructions 

can be driven by local data with an address field 
of minimum bits; second, reliability can be greatly 
improved by connecting the processing elements 

in this manner. 


Hourglass Data-Flow Computing Model 


In an attempt to exploit the potential advan- 
tages of this block-driven principle, a novel 
hourglass computing model was developed (see 
Figure 4). The hourglass is loaded with a pair of 
“double mirrored" trees and has the computing 
power elements at one end, the instruction memory 
cells at the other, and hierarchical tree- 
structured data paths in between. Based on the 
block-driven principle, a block of fired func- 
tional cells pass through a pipe of instruction 
buffer units to the block control master, where 
the functional cells are distributed, and the 
separated computational branches are executed 
jocally on an interleaved basis in a pair of 
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processing elements. Data flow within the hour- 
glass is guided, primarily depending on whether 
data are global or local in type, either into 
transmission paths or reflection paths. The length 
of the transmission paths is fixed; therefore, 
there is no preference for all global data trans- 
fers. The reflection paths are varied, ranging 
from the shortest paths, which are localized in 
the processing pair, to the logarithmic paths 
within the buffering tree. This hourglass model 
has highly concurrent activities at both ends, 
while global data, which link the functional | 
cells, are trickling in between. 


Block-Driven Computer Architecture 


Various tree-structured computing machines 
have been proposed [11,12]. The tree machines 
have very high levels of concurrency and are well 
suited for implementation with current VLSI 
technology [11,12,13]. Based on our hourglass 
model, we propose a data-flow tree machine. It 
contains two kinds of trees: one is called the 
buffering tree and the other is the distributing 
tree (see Figure 5). In what follows, we will 
briefly describe the structures of these two 
trees and the associated functional units. 


Buffering Tree--The buffering tree is an n- 
level binary network capable of computing 2” FLP 
operations concurrently and transferring the 
results efficiently. It consists of 2"-1 nodes 
and one root node. Each one of these nodes is a 
controlling buffer, at which each visiting data 
item will either be buffered down or be trans- 
ferred out. This choice is based on a number of 
conditions which will be discussed later. Also 
associated with each leaf node is a pair of local 
supervisor and processing element where FLP 
operations are performed. The buffering tree has 
two primary roles: first, it is an interconnec- 
tion network to the 2 processing elements; 
second, it is a buffering channel between the pro- 
cessing elements and the instruction memory cells. 


Data movements within the buffering tree are | 
based on a two-phase push-pull mechanism. While 
capable of being pushed forward and pulled back- 
ward, local data can be precisely moved from one 
leaf node to another in a few number of push-pull 
cycles. Global data, which use the buffering tree 
as the channel, can be-pushed forward through the 
network and off via the root node in n push 
cycles. | 


Distributing Tree--The distributing tree is 
an m-level pipelined binary switching network. 
It provides the basic mechanism of routing streams 
of globally independent data to a set of function- 
al cells. There are 2. functional cells and data 
operands tied to the lowest level leaf nodes of 
the tree. Through the use of an m-bit address 
header, data items can be routed to their desti- 
nation cells. As data items enter the root node, 
they are pipelined through this m-level distrib- 
uting tree. 


As a data item is passed from one node to 
the lower level node, the select bit in the 
address header is deleted. As a result, the m- 
bit header is eliminated from a data item when 
the routings are concluded. 


Block Control Master--The major role of 
the block control master is to provide the tree 
machine with concurrent joint computations. 

The master communicates with all the local leaf 
supervisors. The master acknowledges when the 
tree machine is released from one block of 
computations. Then a following computational 
block will be sequenced and be distributed over 
the local leaf supervisors, and a joint 
computation will be executed. 


Instruction Buffer--The instruction buffer 
unit is used for the queueing of blocks of 
functional cells which are fired and ready to be 
executed and is built with intelligent FIFO buffer 
memories. 


Push-Pull Data Transfer Mechanism 


Because there are two types of data to be 
transferred in the same network, each node of the 
buffering network--a controlling buffer--is de- 
signed to work on a two-phase basis. In the push 
mode, data are pushed forward from the lower- 
level nodes to the higher-level nodes; whereas, 
in the pull mode, data are pulled backward in an 
opposite way. Each data item is tagged with a 
destination address field and a one-bit data type 
header. The width of the address is determined by 
the tree height--the higher the tree height, the 
wider the field. Specifically, a locally depen- 
dent data item has a relative displacement ad- 
dress and a one-bit direction header, while a 
globally independent data item has an absolute 
address. The relative displacement address is 
determined by the distance in which the two commu- 
nicating leaf nodes are apart and by the relative 
position in which the two nodes are located (see 
Figure 6). Data to be pushed forward or pulled 
backward depend on a one-bit mode control by 
ORing the one-bit data type header and selected 
bits in the address field. With a tree height 
of n, this mode control at the ith level, 

M., is given by 
1 
n-1 
i as O0<i<n-l, 


where the notation U stands for the logic OR 
operation. If the mode control is "1", data will 
be pushed forward; otherwise, they will be buffer- 
ed at the node at which they last visited and be 
ready to be pulled backward. 


Global data, which carry a "1" in the data 
type header a,, will allow themselves to be pushed 
forward through the buffering network. However, 
local data, which carry a "0" in the data type 
header, can never be pushed forward beyond the 
root level, because the mode control at the root 
level is always "0" for these data. Data which 
have already been buffered down to a node at some 
level will be pulled backward by one of the two 
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son nodes. The decision is determined by a left- 
right control--LR,-~at that level, with 


1<i< n-l, 


+aa,, 
ni 


a. 

ef pal 
where a is the direction header which determines 
whether the data to be directed to their right or 
to their left. If the left-right control is "1", 
the data will be pulled backward by the right son 
node, and if it is "0", they will be pulled back- 
ward by the left son node. A three-level buffer- 
ing tree is illustrated in Figure 7. 


Discussion 


The block-driven data-flow processor describ- 
ed executes clusters of data-flow instructions in 
a block of locally tree-structured processing 
elements. Two classes of data communication paths 
exist. The first communicates global data among 
blocks of the locally tree-structured processing 
elements. The second communicates local data 
among processing elements within a block. Maximum 
throughput occurs when the ratio of the local data 
communication rates within a block is much-much 
greater than the global data communication rates 
for the data channels bringing operands into or 
taking results from a block. So, the granularity 
of the tasks performed within the structured 
blocks determines the overall system performance 
for a fixed number of blocks and processing ele- 
ments. We are currently assessing the complexity 
of this data-flow structure in the context of its 


application to both vector processing and discrete- 


time filtering. 
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Figure 2. The ring configuration: 
CM: computing module and 
LS: local supervisor. 
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Figure 3b. Steps in an interleaved 
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Summary 
The topics of data driven computer and pro- 
gram organization have attracted considerable 


attention in recent years. A number of 
architectures have been proposed and several pro- 
“totype machines are now either operational or are 
being built[{1-8]. In addition, various groups 
have developed languages based on principles com- 
patible with data driven execution[8-13]. In this 
paper we describe two approaches to the problem of 
associating operations in a program with the pro- 
cessors which will execute them. The goal is to 
reduce the time required for program execution ies 
making judicious processor allocations. 

In this paper we are concerned with a class 
of programs written in a textual-form single 
assignment language. The fundamental control 
structures assumed are: BEGIN-END, WHILE-DO, and 
IF-THEN-ELSE. We assume that the parallelism is 
implicit, rather than explicitly expressed by a 
COBEGIN-COEND type construct. (The assumptions 
outlined here are completely consistent with those 
inherent in most, if not all, of the single 
assignment languages that have been developed.) 
The data driven rule for execution implies that 
unless there is a direct data dependency of one 
Operation on another, two operations can be done 
in parallel (provided other architectural con- 
straints of the system are met). Thus an _ opera- 
tion can be performed at any time after all of its 
operands are available, independent of any ex- 
ternal timing constraints. 


An optimal assignment of operators in a pro- 
gram to functional units is one which minimizes 
execution time. In a data flow system, there are 
two factors contributing to the overall 
time of a program: the computation time associated 
with each operation and the time required for 
transmission of results to the appropriate 
destinations for use as operands. It is not 
difficult to show that the problem of obtaining 
optimal allocations is NP-complete, hence the cost 
of doing so is prohibitive. 


For this. reason, the goals of our processor 
allocation schemes are threefold. First, an 
allocated program should take advantage of as much 
of the inherent concurrency as possible. Second- 
ly, since we believe that communication delays can 
have a significant effect on execution time, the 
time lost to interprocessor communication should 
be reduced wherever possible. Finally, the amount 
of analysis at runtime needed to perform any 
necessary dynamic allocation should be minimized. 
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plementation is 


Thus we attempt to allocate as completely as 


possible prior to runtime. 


The problem of allocating operations in a 
program to functional units in a data driven 
machine can be divided into two phases. The first 
of these involves reducing the magnitude of the 
problem by decomposing the program into smaller 
portions that are easier to analyze for alloca- 
tion. The second phase deals with actually per- 
forming the allocation. In the paragraphs which 
follow, we describe two sets of algorithms for 
solving the problem. Each has been implemented in 
PASCAL and data has been gathered with respect to 
its efficacy and complexity. 


Our first approach to the problem of resource 
allocation in data flow systems involves decompos- 
ing a program written in a block structured single 
assignment language into smaller segments accord- 
ing to the level of nesting associated with a 


block and analyzing each "local" subprogram to de-. 


termine the implicit parallelism and any inherent 
sequential restrictions. The statements in each 
program segment are then assigned to processors, 


and the results of the "local" allocation are pro- 
pagated for use in allocating operations in its 
containing block. 


As was previously mentioned, an algorithm for 
performing this analysis and allocation has been 
developed and implemented in PASCAL. Its im- 
a family of mutually recursive 
routines which accomplish the decomposition, pro- 
ducing lists of statements that can be done in 
parallel. These groups of "parallel" statements 
are then analyzed to determine what data de- 
pendencies exist between then. A processor 
allocation for each block is then produced and re- 
turned for use in obtaining a global allocation. 


allocation within each 
level of nesting is performed just after that 
segment has been analyzed. Two criteria are used 
in determining which processor should perform a 
If several statements are con- 
strained to be executed y; 

assigned to the same processor, that processor be- 
ing the one with the smallest number of statements 
already assigned to it. 


In this approach, 


A second approach to the problem of processor 
allocation has also been investigated. This 
approach also involves decomposing a program into 
smaller segments, but in this case the decomposi- 


tion takes a different form. Each block is first 
analyzed in order to determine the global 
sequential restrictions. In finding these re- 
strictions, a data dependency graph is constructed 
which exposes the dependencies that imply 
sequential restrictions. Given this graph, an 
allocation is obtained by traversing the graph and 
taking two factors into consideration at each 
stage: which operations have their operands avail- 
able and the amount of communication overhead that 
would be involved in allocating a statement to a 
given processor. Algorithms for allocation com- 
patible with this approach have also been designed 
and implemented in PASCAL. These algorithms are 
iterative in nature, whereas those mentioned pre- 
viously are highly recursive. 


It is not difficult to see that each of the 
two approaches outlined in this paper produces 
allocations that are, in general, not optimal. An 
analysis of the algorithms developed to implement 
the first approach mentioned reveals that a bound 


on the worst case time complexity is O(N?) where N 
is the number of statements in the program. It 


can also be shown that o(n’) behavior cannot be 
achieved, since not all of the pathologically 
difficult conditions can arise at once. In order 
to obtain the obvious improvement (over the 
optimal case) in the time required to produce an 
allocation, this approach sacrifices optimality. 
Not all of the inherent concurrency is exposed, so 
it is not possible to use all of this concurrency 
in determining the processor assignment. 


The second approach also provides a way of 
obtaining processor allocations quickly. The 
worst case complexity of this algorithm (as it has 


been implemented) is also bounded by o(N?), though 
we believe the average case will exhibit 
O(N log N) performance. It is evident that these 
algorithms, too, restrict the inherent parallelism 


in constructing the data dependency graph. This 
(potentially) adds some sequential restrictions, 
so that the allocations produced cannot, in 


general, be optimal. 

Each of these implemented algorithms has its 
advantages and disadvantages. On small programs, 
with varying numbers of processors in the 
simulated system, each appears to produce very 
good results. The process of gathering more 
Statistical data is, of course, an ongoing one. 
Research is also proceeding on the allocation pro- 
blem in the context of an expanded set of 
linguistic constructs. 
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DATAFLOW APPROACH TO DISCRETE SIMULATION 
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Summar y 


Discrete simulation is the technique of 


Simulating the dynamic behavior of a system at 


discrete points in time. Languages for discrete 


Simulation, such as GPSS, SIMULA, etc., are based 


on a sequential and centralized scheduler of 


events; however, greater concurrency can be 
achieved by a distributed approach, as proposed 
in [1], [2] and [4]. 


this idea in the context of a dataflow model [3] 


In this paper, we explore 
for two main reasons: 1) The interconnection and 
flow of entities in a simulation model, such as 


in GPSS, closely resemble the flow of streams 


between operators in a dataflow graph. 2) 
Dataflow models exploit both pipelined and 
horizontal concurrency; hence, unrelated or 


concurrent events can be executed asynchronously 


and concurrently 


In our dataflow simulation model, each data value 
is tagged with a positive integer, called a time 
tag. A stream of such tagged data values always 
has time tags in 


monotonically increasing or 


chronological order. Streams may be finite or 
infinite (As in most data-flow models, the size 
of an input stream to any operator is assumed to 
be unbounded). The time tag of a data value 


represents the local time of some temporary 


entity in the Simulation model. The processing 
of a data value by an operator is called an event 
and begins at the local time of the operator. 
The completion of an event can, but need not 
always, increase the local time of an operator. 
The concept of local time is introduced since 


there centralized 


is no global clock and “no 
scheduler of events in our dataflow simulation 


model. 


0190-3918/81/0000/0158$00.75 © 1981 IEEE 


We present some primitive operators for dataflow 
Simulation, and discuss the problem of correct 
simulation of events. The choice of operators is 
consider 
rather the 


properties of the entities. 


motivated by a need to the dynamic 


aspects of simulation, functional 


Since the dynamic 
behavior of our dataflow simulation model is 
determined mainly by the flow of streams, the 
operators of interest are various functions on 


Streams. Typical operators from this set are: 


If D is an untagged stream of 
nonnegative integers (d., d., ...) 
and S is a tagged stream (<v1,t4>, 
<Vo, bo>reee) then S!' is a tagged 


stream (<v,,t! >, <V5t! reese) 
where t's we dj + max(t'; _4,t;), and 
B S t', = 0. 
0 , 
Choice) B is a untagged stream of bit 
values, and 8S, S', S'' are tagged 
Ss oo streams. If the first bit value in 
B is 1 then the first data value in 
S is gated out to S' else the first 
data value in S is gated out to 
>" 5." Sth | 


S is the result of sorting S'! and 
S'' based upon their time tags. 


merge 


We assume two operators, source and sink, for 


creating and destroying tagged streams 
respectively. The delay operator models a 
Ssingle-server queueing system, in which the 


stream D represents the service ‘times for the 
objects in 8. The choice operator is used for 
splitting up a stream; its control stream B will 
usually be 


generated by some probability 


distribution. The merge operator produces an 
output stream that is in chronological order; 
hence, it needs a tagged data value on both its 
inputs before 


it produces an output. As as 


consequence, it may be assumed that the merge 
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Operator is determinate, i.e. will produce the 
same output stream given the same pair of input 


streams. 


There are basically two types of dataflow 
Simulation graphs: acyclic (figure 1) and cyclic 
(figures 2 and 3). Two types of cyclic graphs 
may also be distinguished: simple (figure 2) and 
shared (figure 3), depending on whether or not a 
merge operator is shared between two cycles. 
Cyclic graphs represent the notion of feedback in 
the simulation model and therefore are the more 
interesting case. Deadiock occurs in cyclic 
graphs due to a circular dependency between the 
output of a merge operator and its input. 
However, using information from other merge and 
delay operators in a cycle, it is possible to 
break deadlocks in a distributed fashion. In 
comparison, acyclic graphs do not require any 


additional mechanism for their correct operation. 


We now summarize a method for breaking deadlocks 
in simple cycles. Four phases may be identified 
TEST, START, 


During the TEST phase, a 


in the execution of a Simple cycle: 
EXECUTE, and RE-TEST. 
test message is sent around the cycle, by a 
pre-determined merge operator in the cycle, 
polling information from each delay and merge 
operator, in order to determine which merge 
In the START 
phase, the merge operator chosen to break the 


The EXECUTE 


operator must break the deadlock. 


deadlock is sent a start message. 
phase represents operation of the cycle after the 
deadlock has been broken. Special exit messages 


are sent around the cycle, by each choice 
operator in the cycle, to determine when the last 
data value leaves the cycle, i.e. to detect the 
The RE-~TEST phase begins 


when some choice operator detects the recurrence 


recurrence of deadlock. 


of deadlock and sends the pre-determined merge of 
the cycle a re-test message. The receipt of this 
message re-initiates the TEST phase all over 


again. 


Assuming M, oe M. are n merge operators in the 


cycle, dj 


operators eae Me and Mid? loc, 


is the composite delay of all delay 


is the local 
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time of this composite delay operator, and t. is 
the time tag of the first data value of the input 
are of Mi that is not in the cycle, then the 


merge operator M, chosen to break the deadlock is 


such that Sion 


San Sar Shan? where 


S.) = if k < j then t, else 
J» 
dy + max(1oé,, Sj 4k-1) 


has the minimum value over all 


The deadlock is broken by sending the data value 
ty along the output of M;. 


An extension of the above method can be used for 


breaking deadlocks in shared cycles, but is 


omitted here due to shortage of space. Work is 


in progress in formalizing the algorithm for this 


extension. 
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Summary 


Data flow architectures bring a great contri- 
bution about the parallelism exploitation because 
they are able to detect, at execution time, ins-. 
tructions which are executable concurrently. Howe- 
wer, most of the time, parallelism is exploited at 
program instruction level [1-3]. That implies a 
considerable flow of communication in the system 
because instructions, and sometimes their operands, 
are communicated separatly. 


To minimize the flow of information, it may 
seem useful to exploit the parallelism at the le- 
vel of a set of instructions, by grouping the ob- 
jects used by these instructions and these instruc 
tions into an indivisible set that is called a 
"block", The communication between such sets is re- 
duced to input and output parameters. So, a block 
forms a module which contains all the information 
which are necessary for its execution. 

In this paper is described maup'®) , a system 
based on the subdivision of programs into blocks 
and a data driven execution [4]. A program for 
MAUD is a set of blocks. Blocks are composed of a 
set of instructions and of the objects they use. 

A block can be viewed as a generalized primitive 
applied to input objects (I-objfects) which are 
calculated by other blocks, and providing output 
objects (0-obfects) which will be usedas I-objects 
by other blocks. Blocks are built before the pro- 
cessing by the system. 


Communication between blocks is conducted so- 
lely by means of I-objects and O0-objects. A com- 
munication name is associated to I-objects and 
O-objects : it does not point out an explicit me- 
mory cell ; it is used only to point out values. 


The single assignement rule is applied to the 
communication names at the block level. It allows 
the natural expression of dependencies existing 
between the blocks of a program, and therefore the 
expression of parallelism at execution time. More- 
over, it allows the use of a data flow control for 
the execution of a program : a block is ready for 
execution when all its I-objfects, which are 0-ob- 
jects of other blocks, have been propagated to 
this block. Such a block is an executable bkock. 
If one or several of its I-obfects have not been 


(a) : MAUD : Machine d'Assignation Unique Dynamique. 
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assigned values, the block is called a waiting 
block. Executable blocks may be executed concur- 
rently. 


Two special operations used during block exe- 
cution have been defined : (i) an execbLock ope- | 
ration, which allows the execution of a block, a 
model of this block existing in a library. It has 


‘certain similarities to a procedure call (it is 


an execution request for a block which must exist 
in the library) and to a FORK operation (the exe- 
cution of the requested block may run concurrently 
with the execution of the block which made the re- 


quest) ; (ii) a watt operation which allows the 


calling block to wait 0-objects calculated by the 
execbkLock operations. The execution of the block 
which performs a watt operation is then suspended, 
and this block is transformed into a waiting 
block, waiting for the objects which appeared in 
the watt operation. Thus, the execution processor 
becomes free. | | 


The use. of execblock and watt operations gi- 
ves a dynamic characteristic to the execution of 
a program because of the addition of a number of — 
blocks which are executable concurrently during 
the program execution. An example of the utiliza- 
tion of these operations can be found in [5]. 


The multiprocessor is composed of : 

- a Set of execution processors P. whose 
function is the execution of a block seach pro- 
cessor has a local memory and is able to execute 
a block in an autonomous way. The blocks are not 
dedicated to the processors. As soon as a proces- 
sor is idle because it has just executed a wait 
operation or because the block execution is over 
it searches a new executable block, if there is 
one left. 

- an updating processor, UPD, which updates 
the waiting blocks with O-objects produced by the 
execution processors at the end of block execu- 
tion, and which finds out the waiting blocks with 
all the I-objects assigned in order to transform 
them into executable blocks. 

- a builder processor, BUILDER, whose func- 
tion is to build a block using the execution ne- 
quests produced by the execution processors when 
they perform execbkock operations. It manages the 
library of blocks. 


There is no direct communication path between 
the processors. The waiting blocks and the O-ob- 
jects are sent to the UPD processor, and the exe- 
cution nequests to the BUILDER processor. But they 


are sent through shared memories 

- an A-memory, which holds the wacting 
blocks ; 

- an S-memory, which holds the 0-objects pro- 
duced by the execution processors at the end of 
the block execution 

- a D-memonry, which holds the execution re- 
quests produced by the execution processors when 
they perform an execblock operation. 

- an X-memory holds the executable blocks. 


A functional description of MAUD is shown in 
Fig. 1. An example of the execution of a program 
in MAUD can be found in [6]. 


An implementation of MAUD has been studied. 
The processors are realized with conventional mi- 
croprocessors, except the UPD processor which is a 
very specialized one, because it must be very fast 
while not necessarily very powerful, and it must 
be able to have associative accesses to A-memory 
and S-memory. 


Memories have two functions storage of the 
various objects of MAUD, and communication. For 
the realization, it has been chosen to use a uni- 
que memory shared by all the processors : it is a 
ring of circulating memory. That allows simulta- 
neous accesses for reading and writing by all the 
processors, and it is easy to have associative ac- 
cess for the UPD processor. The ring is divided 


into logical sectors of the same size called slots. 


Every slot can hold any kind of objects, but only 
one at a time ; execution requests and 0-objsects 
circulate temporarily (no more than one lap) ; 
waiting bkocks and executable bkocks are kept cir- 
culating in the ring until the former become exe- 
cutable blocks and the latter are picked out from 
the ring by a free processor. A MANAGER processor 


is necessary to regulate the load of the ring, i.e. 


to put some blocks temporarily out of the ring 
when the number of empty slots gets too small, and 
to put them back inside when the number of empty 
Slots is increasing. In fact, no new processor is 
needed, this function may be realized by the UPD 
processor or the BUILDER processor. 


In order to justify this choice, a simulation 
of the system with the above characteristics of 
hardware implementation has been done. The obtai- 
ned results for a simplified version of the system 
makes it clear that the gain in the processing 
speed is important compared to a conventional mo- 
noprocessor system, if the block execution time is 
not too short compared to the duration of a com- 
plete lap of the ring. Yet, this condition is not 
necessary because it is possible to have dynamic 
reconfigurations of the ring allowing the reduc- 
tion of the access time [7]. 
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HIGH LEVEL SPECIFICATION OF RESOURCE SHARING 
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Summary 
A high level specification language is des- result in a state described by the resource con- 
cribed making it. possible to very concisely specify straints. State variables are defined local to the 
the orderly sharing of a protected resource [1]. resource scheduler and reflect the actual state of 
The rules and policies dictating resource usage the resource. 
are specified separately and clearly making it The ordering policies specify the usual 
easy to write, understand, and change them. Since policy used to decide what request should receive 
the specifications themselves are enforced; no service next. Among the requests acceptable to 
errors in resource sharing are introduced in pro- the resource, the ordering policy determines which 
gramming the enforcement of them. actually begins service. In case of ties, the 
| | older request is granted service. The ordering 
Many schemes have been proposed and developed policy is specified in terms of priorities between 
to aid in resource sharing. Hoare's monitors and request types, ordering based upon some value in 
Hewitt's serializers were designed primarily to the request, or some other standard ordering 
enforce cooperation among users sharing resources. scheme (e.g., elevator algorithm). 
These schemes provide primitives and language 
structures which make it relatively easy to The main purpose of the ordering policies 
write code to enforce the necessary rules and is to achieve efficiency in resource use or effi- 
desired policies upon resource sharing. ciency in the processes which use the resource. 
Efficiency considerations alone, however, can 
This work describes how to directly specify lead to very poor service or no service for some 
the resource sharing rules needed and policies requests. The postponement and expedite policies 
wanted. The code to enforce these rules and are used to modify the ordering policies to avoid 
policies can then be automatically generated from extremely poor service. 
the high level specification provided. The advan- 
tages are clear. Since the rules and policies are The postponement policies specify under what 
specified directly, it is known exactly what they conditions newly arrived requests are not to be 
are and that they are enforced. considered for selection even if they would be 
acceptable to the resource and no other waiting 
Ramamritham and Keller [2] concurrently with requests are acceptable. If, however, there are 
and independent of this work attacked the same no waiting requests then postponed requests may 
problem. Their specification language is at a be selected for service. A request can only be 
different level. State variables are conceptually postponed when it initially arrives and then only 
different and the implementation schemes are until the postponed condition for it becomes 
entirely different for the two systems. false. The postponement conditions may involve 
the current resource state and a consideration of 
Request messages for a protected resource are other waiting requests. 
sent to its scheduling module. This module uses | 
the high level specifications provided to determine The expedite policies specify under what 
what requests and when requests are sent to for conditions the ordering policies are to be vio- 
service. These specifications are: lated and a non-postponed request is selected to 
‘description of the requests, be the next request in line for service. No 
*resource constraints, other requests are allowed ahead of a request 
‘ordering policy, chosen by expedite. 
*postponement policy, and 
‘expedite policy. The postponement policy should be used to 
hold back requests which might otherwise cause 
The description of the requests defines the starvation of any of several requests waiting for 
fields in request messages that will be used by service. The expedite policy should be used to 
the resource scheduler to aid in scheduling them. identify a request being starved and make it 
The resource modules that provide the service for next for service. 
each type of request are also specified as well as ; 
the updates which the performance of these requests Figure 1 illustrates the scheduling strategy. 
cause to the state variables. A process requests service by sending a request 
message to the Scheduling Module. The Scheduling 
The resource constraints consist of defining Module implements the resource sharing specific- 
those states in which the resource continues to tions, forwarding the request to the Protected 
correctly service requests. These states are des- | Resource Module when it is to be performed. When 
cribed in terms of the values of state variables service is complete, the process receives a re- 
and the requests that can simultaneously be ser- Sponse message. Requests can only reside as post- 
viced by the resource. A request is acceptable to poned requests, ordered requests, expedited 
the resource if its inclusion for service would requests, or requests being serviced. The un- 
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new request arrivals 


SCHEDULING MODULE 


postpone conditions 


POSTPONED 
REQUESTS 


ORDERED 
REQUESTS 


postponed 


conditions 
true 


expedite conditions 


EXPEDITED 
REQUESTS 


resource constraints 


ordering 
resource constraints 
no expedited requests 


no ordered requests 


resource constraints 


increase uncertainty 
in state variables 


REQUESTS 
BEING 
SERVICED 


in state variables 


response to 
MODULE 


Figure 1: Overall Scheduling Strategy 


certainty in state variable values increases when 
a request begins service and decreases when ser- 


vice is complete. This handling of state variables 


is faithful to what the scheduling module can know 
of actions of the resource module allowing for 
natural specifications of resource constraints. 
Figure 2 is a specification of the classical 
producer/consumer problem. The resource can hold 
up to 10 items. An insert request message to the 
insert routine adds another item into the re- 
source and a remove request message removes an 
item from it. At most one insert request and one 
remove request can be serviced at the same time. 
The maximum number of items that can be placed in 
the resource is 10 and the minimum number is 0. 
To use the constraints, the preconditions for each 
type of request are derived. For the case of an 
insert request, the preconditions are less than 10 
items already saved and no insert request receiv- 
ing service. The number of items is kept track of 


DECLARE STATE VARIABLES #items INITIALLY 0 


REQUEST DECLARATIONS 
REQUEST FIELDS type CHARACTER(1) 
item CHARACTER(99) 


insert HAS type = 'I' 
remove HAS type = 'R' 


PROCESSING 
insert PROCESSED BY insert-routine 
UPON SERVICE #items := #items + 1 


remove PROCESSED BY remove-routine 
UPON SERVICE #items := #items - 1 


RESOURCE CONSTRAINTS 
insert.ACTIVE < 1 AND remove.ACTIVE < 1 
AND 0 < #items AND #items < 10 


Figure 2. A Producer/Consumer Problem 


no expedited requests 


decrease uncertaint 


completed requests 


in the scheduling module through the use of the 
local state variable #items. The PROCESSING 
clause indicates that during service of an insert 
request, the number of items is increased by 1 

and during service of a remove request, the number 
of items decreases by 1. There is, however, un- 
certainty as to exactly when these changes occur. 
#items is kept as a range of possible values. For 
example, if there were 9 items and both a remove 
and insert request were receiving service, #items 
is the range [8,10]. If the remove request com- 
pletes first the range becomes [8,9] and when the 
insert request subsequently completes the range 
becomes [9,9]. 


Figure 3 is a high level specification for 
sharing a moving head disk. ORDERING specifies 
both a primary and secondary ordering policy. If 
there is more than one request for a disk address, 
then the write requests are done before the read 
requests for that address. The elevator algorithm 
insures that no addresses (at the ends) are ig- 
nored. The only way a request can wait forever is 
if new requests for the same address keep receiv- 
ing service. The POSTPONE prevents this by not 
allowing these newly arrived requests to be con- 
sidered for service until the disk has moved off 
the address they are requesting 
(THISREQUEST.addr = ACTIVE.addr). 


REQUEST DECLARATIONS 


REQUEST FIELDS type CHARACTER(1) 
addr CHARACTER (6) 
data CHARACTER(505) 


read HAS type = 'R' 
write HAS type = ‘'W' 


PROCESSING 
PROCESSED BY disk-driver-routine 


RESOURCE CONSTRAINTS 
read.ACTIVE + write.ACTIVE Sik 


ORDERING PRIMARY BY ELEVATOR ON addr 
SECONDARY write BEFORE read 


POSTPONE read IF THISREQUEST.addr 
write IF THISREQUEST.addr 


ACTIVE .addr 
ACTIVE.addr 


Figure 3: Moving-Head Disk Scheduler 
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Summary 


This paper introduces a technique for the 
virtual elimination of conditional branch instruc- 
tions during program execution. The technique, 
called Multiple Path Exploration (MPE), aims 
at increasing the potential concurrency between 
program instructions by, automatically and dynami- 
cally, removing procedural dependencies. 


There are basically two types of dependencies 
between instructions: Data Dependency, when one. 
instruction requires data from a previous instruc- 
tion; and Procedural Dependency, due to the 
specification of the instructions sequence. There 
is a procedural dependency between a branch 
instruction and the instructions following it in 
sequence. Conditional branch instructions cause a 
wait until the condition is resolved before the 
next instruction in the sequence is determined, 
thus imposing severe limitations on the attempts 
to detect and exploit concurrency. 


In order to eliminate the procedural dependency 
caused by the presence of a conditional branch 
instruction in a program, the execution must pro- 
ceed simultaneously down the two possible paths 
emanating from the branch. To bypass x condition- 
al branch instructions, as many as 2* paths must 
be processed simultaneously. 


Instead of bypassing all conditional branch in- 
structions of a program simultaneously, only a 
subset consisting of a fixed number, m, of bran- 
ches may be bypassed at any given time. Out of 
2 paths, only one path will remain valid, while 
all the others may be discarded. Another set of 
(2™) paths, generated from the valid path are 
explored next. This process continues until the 
program is completely executed. | 


Branch instructions are grouped into sets. 
Each set represents a Branch Level (Fig. 1). Each 
path is uniquely identified by a Path Code. A 
Branch Code identifies each instruction with at 
least one path. The concepts of Branch Level, 
Path Code, and Branch Code provide tools to 
automate the process of generating and discarding 
of branch paths dynamically during program execu- 
tion. 


Further performance improvement can be achieved 
if the instructions of each path are not executed 
in a strictly sequential order. This becomes pos- 
sible if there is a mechanism associated with 
every path, which detects data independent instruc- 
tions. The Ordering Matrix technique is suitable 
for the detection of data independent instruc- 
tions, especially in the absence of branch in- 
structions [1], [2]. The Ordering Matrix (M) for 
a sequence of N instructions is an N x N Boolean 
matrix such that: 
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1 Iff instructions I. & I. 
M. are dependent. ] 


0 Otherwise 


The presence of branch instructions in gener- 


al, and backward branches in particular, have 


complicated algorithms to detect data independ- 
ence and limited the amount of concurrency 
detected [2]. Their absence within each path 
enables the detection of more concurrency using 
less complex algorithms [1]. 


Foster and Rieman [3] found that a speed-up 
in program execution by a factor of 51 may 
theoretically be achieved if all conditional 
branches are bypassed. Using the MPE technique, 
the speed-up factor is expected to be as shown in 
Fig. 2 as a function of the number of branch 
levels m ( and consequentially the number of 
streams N) [1]. A speed-up factor of more than 5 
may be achieved for the case of m = 4, where 16 
paths are processed simultaneously by different 
instruction streams. 


The MPE technique may be implemented using 
a Multiple Instruction Stream, Multiple Data 
Stream Organization as shown in Fig. 3. The 
private data memory is used to enable the dis- 
carding of invalid paths. An execution speed of 
15 MIPS may be obtainable. fThe architecture of 
Fig. 3 is discussed in detail in reference [1]. 


The proliferation of VLSI and microcomputer 
technology is expected to make the implementation 
of such a highly parallel system organization 
cost-effective in the future. 
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EXPERIMENT IN PARALLEL PROCESSING | 
A LARGE SCIENTIFIC CODE 


Ingrid Y. -Bucher, Bill L. Buzbee, and 
Paul O. Frederickson 


Computer Research and Applications Group 
Los Alamos National Laboratory 
Los Alamos, New Mexico 87545 


Summary 


We wish to report results of a successful 
initial experiment in our study of the 
usefulness of multiple processor architectures 
for large scientific computations. 


It is believed that a hundredfold increase in 
computational speed will be required over the 
next decade inorder to meet a variety of 
scientific needs’. The prospects of speeding 
up a Single processor mainframe by a factor 
greater than ten beyond the fastest machines 
available today, seem rather dim. It follows 
that parallel processing is necessary in order 
to meet the needs of the scientific community. 

Using current technology, interprocessor 
communication, via either direct communication 
lines or common memory, is a_ significant 
factor to be considered in the design of an 
algorithm to fit on a parallel architecture. 
Our guiding philosophy is to divide current 
computational problems into relatively large 
tasks with a high degree of independence, thus 


minimizing the need for interprocessor 
communication. To avoid common memory 
contention we are looking at parallel 


architectures in which each processor is 
equipped with a reasonable amount of private 
memory. In that framework we wish to consider 
the usefulness of a variety of interconnection 
schemes. 


Our initial experiment involved formulating a 
particle-in-cell simulation of a plasma? for 
a Simple star graph architecture with a UNIVAC 
1110 at the hub Py, of the star and up to 
eight Floating Point System 120B-~ array 
processors at the other vertices Pj. Each 
of the nodes P; was equipped with at least 
48k words of memory, but there was no fast 
access common memory available in this system. 


Figure la illustrates the main computations 
within one time-step of our model algorithm as 
carried out on a monoprocessor. Our 
adaptation of this algorithm to fit on the 
architecture described above is 
Figure lb. As indicated in the figure, the 
potential $6 is computed for n cells from the 
charge distribution C by processor P,. on 
values of > are subsequently transmitted to 
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processors Pj, i=l,...,8. The computation 
of the field E from the potential ¢ is carried 


out by each of the processors Pj, in 
parallel, in order to reduce the data transfer 
from 2n to n items (for a 2-dimensional 
problem). Each processor Pj; then moves its 


share of mj particles through the electro- 
Magnetic field, a step which constitutes the 


P 


Compute Potential 
g = 9C) 


Compute Potential 
| ¢ = C) 


Push m particles. 
through field E, 
Discretize charge 
C at n points 


major contribution to the computational 
process, and computes their contribution Cj; 
to the total charge distribution C. As final 
steps, n values of Cj have to be transmitted 
from each processor Pj; to processor Py 
where they are summed to yield the charge 
distribution C for the next time-step. 


The multiprocessor used in our experiment is 
located at the Naval Ocean Systems Center in 
San Diego. In a fairly typical run, we moved 
m = 32,400 particles, distributed evenly over 
six array processors, in an n=18x1l8_ size 
grid. Each time step required 0.54 s. Of 
this 0.265 Ss was spent solving Poisson's 
equation in the host at a rate of 0.2 
MFLOPS/s. Each array processor P; pushed 
5400 particles in 0.13 s at a rate of 5.7 
MFLOPS/s>, The remaining time, 0.145 s, was 
spent in initiating data transfers to and from 
the array processors and transmitting the 
data, most of it being system overhead. 


It is fairly obvious that the time spent on 
solving Poisson's equation could have been 
reduced to less than 0.01 s by moving the 
process from the relatively slow host to the 
array processors. Our experience shows that 
to speed up interprocessor communications an 
operating system that allows for parallel data 
transfers with a minimum of system overhead, 
or efficient access to common memory, is 
highly desirable. 


We conclude that in spite of the limitations 
of the system used, significant speedups via 
parallel computation are achievable for 
particle-in-cell plasma simulation and related 
problems. 


We thank the staff of the simulation facility 
of the Naval Ocean Systems Center for their 
assistance in using their system, particularly 
Bob Dukelow, John Mayr, and Ron Dahlseid. 
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ITERATORS AND CONCURRENCY 


A. T. Berztiss 
Department of Computer Science 
University of Pittsburgh 
Pittsburgh, PA 15260 


We propose a modest parallel execution facility 
for essentially sequential programs, particularly 
appropriate for very small computers. It consists 
of three approaches. First, iterators are re- 
garded as a major programming tool. We discuss 
parallel composition of iterators, and use of 
iterators as a means for buffered access to ele- 
ments of composite data structures. Second, 
Dijkstra's guarded command set is given a new 
interpretation: all actions for which guards are 
true may be performed in parallel. Third, we con- 
sider the partitioning of iteration sequences into 
segments that may then be executed in parallel. 


Empirical studies of computer programs have 
shown that most processing time is spent in exe- 
cuting loops, and that most loops are concerned 
with providing orderly access to elements of com- 
posite data structures [1] - [3]. Our aim here is 
present mechanisms for achieving limited parallel- 
ism in the execution of for-loops, and we do so in 
the context of data abstraction. 


We shall use iterators coupled to for-loops. 
Iterators are provided in CLU [4] and Alphard [5], 
and their purpose is to deliver the elements of a 
composite data structure in the particular sequ- 
ence determined by the specification of the itera- 
tor. This achieves separation of the traversal of 
a data structure from the computations carried out 
on the objects delivered by the iterator in the — 
course of the traversal. Iterators can therefore 
be made part of an abstract data type. We have 
introduced controlled iteration [6]. This genera- 
lization in the usage of iterators enables the 
same for loop to invoke more than one iterator, 
and the iterators coupled to a given for loop to 
be synchronized. The form of a for loop with con- 
trolled iteration is | 


for J1,J2,...; 
loop body; 
end loop; 
loop tail; 
end for; 


Jn loop 


This construct is described in [6], and familiari- 
ty with this reference is being assumed. 


Because iterators can appear only in a specific 
context, namely the for loop, a single stack suf- 
fices for runtime control (as long as the itera- 
tors are nonrecursive and do not invoke other 
iterators): On reaching a for statement, create a 
vector of n activation records, one for each of 
the Jl, J2, ..., Jn, where it is assumed that an 
activation record contains space for all temporary 
locations needed by the corresponding iterator. 
The vector of activation records is treated as a 
single unit until it is discarded as a single 
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unit on reaching the end for. Iterators are a 
special type of coroutines, but they have advan- 
tages over general coroutines. This has been dis- 
cussed in [6]. Here we note that it is the con- 
fining of iterators to for loops that permits look 
ahead and buffering in the implementation of an 
iterator. 


As regards difficulty of program proofs, our 
generalized for loop occupies an intermediate 
position between a conventional for loop and a 
while loop. Because one can interpret iterators 
as coroutines, our generalized for loop can be 
translated in a mechanical manner to a while loop 
that contains coroutine calls. At the very worst, 
then, the proof rules of the while loop can be put 
to use. Most instances, however, advantage can be 
taken of the fact that the generalized for loop is 
rather closer to the conventional for loop than 
to the while loop. 


_ The use of controlled iteration implies that 
some decisions are made within the loop body. 
Consequently the entire loop body may be an if 
statement. We now introduce a change of notation 
for the if statement: 


if Bl then S51; 
elsif B2 then S82; 


elsif Bn then Sn; 


end if; 
becomes 
cond Bl: sl Q 
B2: s2 J 
Bn: Sn 
end: cond; 


The latter has been made to resemble Dijkstra's 
guarded command set [7]. It then suggests new 
interpretations. When the conditional is inter- 
preted as equivalent to the if statement we call 
it a sequential conditional. Dijkstra's inter- 
pretation is that any Si that corresponds to a 
true Bi may be executed, but this interpretation 
can be carried further: All Si for which guards 
Bi are true may be executed concurrently. The 
conditional then becomes a concurrent conditional 
with the following informal semantics: 


All Bi 
any Si 


b. All Si following true Bi are executed fully, 
or, if any such Si contains an exit, up to 
the exit, and if an exit has been encoun- 
tered, exit from the loop body takes place 
after all this has been done. 


are evaluated before the execution of 
begins. 


ae 


Example: Determine whether or not a given key 
is present in a linear list that is being simul- 
taneously traversed from both ends. Here we can 
have concurrent advances in the list, and con- 
current evaluation of the Boolean expressions 
(guards). 


for X in FILE.UP 
Y in FILE.DOWN loop 


cond 
X.KEY = Y.KEY : exit D 
X.KEY = GIVEN KEY: exit Q 
Y.KEY = GIVEN KEY: exit 
end cond; 
end loop; 


if X.KEY = GIVEN_KEY or Y.KEY = GIVEN _KEY then 
PUT("Given key is in the list"); 
else 
PUT("Given key not found in the list"); 
end if; 
end for; 


We now consider partitioned iteration sequences 
in the context of matrix multiplication: Matrix C 
is to receive the product of matrices A and B, and 
it is assumed that C has already been initiated to 
zeros. In conventional Ada syntax this can be 
written as follows: 


for I in A'FIRST..A'LAST loop 
for K in A'FIRST(2)..A'LAST(2) loop 
for J in B'FIRST(2)..B'LAST(2) loop 
C(I,J):= C(1I,J) + ACI,K)*B(K, J); 
end loop; 
end loop; 
end loop; 

Here matrix C is built up one row at a time. In 
building up a row in C, the corresponding row in A 
is traversed just once, but B is traversed in its 
entirety. What matters is that in generating a 
particular row of C only the one corresponding row 
of A is needed, i.e., the trdversal of A can be 
partitioned into independent traversals of its 
rows. Consequently we now consider matrix A as a 
set of vectors (rows of the matrix). The matrix. 
multiplication code is reformulated to make use of 
the separation of the matrix into a set of vectors 
to induce parallelism. It is the declaration of 
the data structure as a set before the for loop is 
entered that enables the system to recognize the 
opportunity for concurrency. The loop itself con- 
tains no indication to this effect. 


for X in ROWSETA.TRA loop 
I:= C'FIRST(2) ; 
for controlled A in X.FORWARD, 
B in MATB.ROWWISE(ENDROW) loop 
C(X.ROWNO,1I):= C(X.ROWNO,L) + A.VAL*B.VAL; 


I := I+l1; 
if ENDROW then 
promote A; 
I:= C'FIRST(2); 
end if; 


end loop; end for; 
end loop; end for; 


Here we have three iterators: (i) Iterator TRA 
delivers a complete row of the matrix. It is 
understood that the object denoted by ROWSETA 
functions as a set of vectors. The guise of this 
matrix as a set permits parallelism. The body of 
the outer loop can be executed concurrently by as 
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many processors as there are rows in the matrix. 
An attribute of the row delivered by TRA is the 
index of this row in reference to the matrix as a 
two-dimensional array (ROWNO). It establishes 
correspondence between the rows of the input 
matrix and of C. (ii) FORWARD delivers the ele- 
ments of the row supplied by TRA. (iii) ROWWISE 
is associated with the second input matrix in its 
guise as a proper matrix (MATB). It delivers ele- 
ments of MATB one by one in roworder. Parameter 
ENDROW associated with ROWWISE is normally false, 
but it becomes true for any pass through the loop 
in which an element that terminates a row in MATB 
is being accessed. 


One problem that remains is the synchronization 
of components of several partitioned iteration 
sequences. Such is the case when matrices A and C 
are both regarded as sets of row vectors. Then it 
has to be ensured that the row in C generated 
using a particular row in A properly corresponds 
to this row in A (for example, that the row gene- 
rated using the second row of A becomes in fact 
the second row of C). Iterators are used to en- 
force the required correspondence. If the same 
iterator ranges over two sets, then an ordering of 
the elements of the sets into sequences has to be 
assumed, and corresponding elements from the 
sequences are assigned to the same instance of 
execution of the loop body. 


Note that our purpose has not been to solve 
general concurrency and synchronization problems. 
Nevertheless, as regards the execution of such 
programs as are currently executed on very small 
computers, use of our mechanisms can lead to 
substantial reduction in execution time by en- 
abling the computational load to be spread over 
several processors. 


References 


[1] D.E. Knuth, "An Empirical Study of Fortran 


Programs", Software--Practice and Experience 
1 (1971), pp. 105-133. 


[2] T.W. Pratt, "Control Computations and the 
Design of Loop Control Structures", IEEE 


Trans. Software Eng. SE-4 (1978), pp. 81-89. 


[3] D. Grune, "Statistics of Algol 68 Programs", 
SIGPLAN Notices (ACM) 14, 7 (July 1979), 
pp. 38-46. 


[4] B. Liskov, A. Snyder, R. Atkinson, and 
C. Schaffert, "Abstraction Mechanisms in CLU", 
Comm. ACM 20, 8 (Aug. 1977), pp. 564-576. 


[5] M. Shaw, W. Wulf, and R. London, "Abstraction 
and Verification in Alphard: Defining and 
Specifying Iteration and Generators", Comm. 
ACM 20, 8 (Aug. 1977), pp. 553-563. 


[6] A.T. Berztiss, "Data Abstraction, Controlled 
Iteration, and Communicating Processes," Proc. 
ACM Annual Conf., Nashville, TN, 1980, pp. 
197-203. . 


[7] E.W. Dijkstra, "Guarded Commands, Nondetermi- 
nacy, and Formal Derivation of Programs", 
Comm. ACM 18, 8 (Aug. 1975), pp. 453-457. 


OPTIMAL PARALLEL ALGORITHMS FOR THE CONNECTED. COMPONENT PROBLEM) 


Francis Y. Chin 


John Lam 


I-Ngo Chen 


Department of Computing Science 
University of Alberta 
Edmonton, Alberta T6G 2H1 


Abstract -- In this paper, we study a parallel 
algorithm for computing the connected components 
of an undirected graph, using the Single Instruc- 
tion Stream-Multiple Data Stream model. We as- 
sume that the processors have access to a common 
memory and that no memory or data alignment time 
penalties are incurred. We derive a general time 
bound for a parallel algorithm which uses K pro- 
cessors for finding the connected components of 
an undirected graph. In particular, an 0(log4n) 
time bound can.be achieved using only | 
K = n In/log2nl processors. This result is op- 
timal in the sense that the speedup ratio is 
linear with the number of processors used. The 
algorithm can also be modified to solve a whole 
class of graph problems with the same time bound 
and fewer processors than previous parallel 
methods. 


I. INTRODUCTION 


The dramatic drop in the cost of computers en- 
courages the use of parallel computers. Parallel 
computers are capable of performing several in- 
dependent operations concurrently. In the fol- 
lowing discussion, we assume that (1) processors 
share the same memory; (2) each processor can 
perform any arithmetic, Boolean or logical opera- 
tions in one time unit, and all instructions ex- 
ecuted in parallel are identical (Single Instruc- 
tion Stream-Multiple Data Stream [7]); (3) sim- 
ultaneous read operations on the same location 
are allowed, but not simultaneous write opera- - 
tions; (4) no memory or data alignment time pen- 
alties [15] are incurred. 


Parallel algorithms for sorting and numerical 
applications have received substantial attention 
recently [1,9,11,17,23]. Much work has been done 
on the development of efficient parallel graph 
algorithms [5,6,8,10,12,14,18,19]. A parallel 
algorithm which uses n2 processors to find the 
connected components of an undirected graph with 
n vertices in O(log2n) time was proposed in [10]. 
Recently, it has been shown by Hirschberg et al. 
[12] that an O(log2n) time bound can also be 
achieved using only n [n/log n] processors. In 
this paper, we present a modified version of this 
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only K processors are available. 


can be described briefly as follows. 


algorithm which requires 0(n2/K + log2n) time if 
: In. particular, 
the 0(log2n time bound can be achieved with on- 
ly nf n/log*n] processors. This modified algor- 
ithm is optimal in the sense that the speedup 
ratio [21] is linear with the number of proces- 
sors used. Furthermore, we demonstrate that 
this algorithm can be used to solve a class of 
graph problems with the same time and processor 
bounds in a forthcoming paper [4]. 


Section 2 presents definitions used in this 
paper. Section 3 studies the modified algorithm 
and derives its time and processor bounds... Sec- 
tion 4 summarizes these results and discusses 
further research possibilities. 


IL, DEFINITIONS 


An undirected graph G = (V,E) consists of a 
finite, non-empty set V of n vertices and a set 
E of unordered pairs of vertices called edges. 
We represent G by its adjacency matrix A, which 
is an nxn symmetric Boolean matrix where 
A(i,j) = 1 if and only if (i,j) € E. G is con- 
nected if there exists a path between every pair 
of distinct vertices in V. A connected compo- 
nent of G is a maximal connected subgraph of G. 


If Tk is the time required by a parallel al- 
gorithm using K 2 |. processors, the speedup rat- 
io of the K-processor computation over the cor- 
responding uniprocessor computation (taking time 
Ti) is defined as Sy = Tj/Tx. 


Throughout the paper, log n denoted [loggn |] 


iti. CONNECTED COMPONENTS OF AN UNDIRECTED GRAPH 
Figure 1 shows the algorithm MOD.CONNECT for 
finding the connected components of an undirect- 
ed graph. The actions of algorithm MOD.CONNECT 
Each ver- 
tex belongs to exactly one connected component. 
Array D is used to specify the connected compo= 
nent for each vertex, thus D(i) = D(j) if and 
only if vertices i and j belong to the same 
component. Step | initializes the arrays D and 
Flag whose function will be discussed later. 
During the first iteration, step 2b selects the 
smallest numbered vertex among all the vertices 
incident upon vertex i and assigns it to C(i). 
Step 3 eliminates the isolated vertices. Steps 


Algorithm MOD. CONNECT 


Input: The nxn adjacency matrix A for an un- 
directed graph. 


Output: The vector D of length n such that D(i) 
equals the smallest-numbered vertex in 
the connected component to which i be- 
longs. 


Comment: Each of the following steps is executed 
in parallel for all i, 0 < i <n oor for 
all i¢€S. The assignments in the various 
steps are considered to be done simulta- 
neously for all i. 

The vector Flag of length n such that 
Flag(i) = 1 indicates vertex i is a 
current supervertex. Current supervert- 
ices are stored in set S. 


l for all i, 0 <i <n do comment: Initialization 

~ Di) <--i 

Flag(i)<--1] 

do step 2 through 8 for log n iterations 

comment: Uniform Smallest Incident Node 

Selection 
2a S<--{il Flag(i) = 1} comment: D(i) = i for ieS 
2b for all ieS do 

~ C(i) <-- MinfJ | ACi,j) = 1} 

jes 
if. none then i 

comment: Eliminate the isolated supervertices 
3 for all ieS do 

if C(i) = 7 then Flag(i) +0 

comment: Path Compression 
4 for all ieS do D(i) <--C(i) 
5 for log n interations do 

~ for all ieS do C(i) <--C(C(i)) 7 
6a for all ieS do D(i)<-- Min{C(i), D(C(i))} 
6b for all i, 0 <i <n do D(i) <--D(D(i)) 

comment: Clean Up (by column contraction) 
7a for all ieS do 

~ for all j S$ s.t. j=D(j) do 

~ ACi,j) <--OR{ACi,k) D(k)=j} 
keS 

7b for all jeS s.t. j=D(j) do 

~ for all ieS s.t. i=D(7) do 

A(i,j) «OR {A(k,j) [D(k)=i} 

KeS 
7c for all ieS do A(i,i) <--0 
8 for all ieS do if D(i)# i then Flag (i)<--0 


Figure | Algorithm MOD.CONNECT 


4-6 perform path compression and merge vertices 
which are known to be in the same connected com- 
ponent into a single ''supervertex''. Steps 7 and 
8 eliminate the merged vertices and store all the 
information about their edges into the super- 
vertices. In succeeding iterations, S contains 
the indices of the supervertices and the whole 
process is repeated on the graph represented by 
the adjacency matrix A restricted on S. Super- 
vertices are merged to form super-Supervertices, 


and so on. This merging process is repeated log 
n times until each connected component is repre- 
sented by a single vertex. Array D contains the 
information about which vertices are in the same 
component and step 6b updates D(i) for all i, 
i.e. updates the supervertex into which i is 
merged. The main difference between algorithm 
MOD.CONNECT and algorithm CONNECT in [12] is the 
introduction of the vector Flag of length n, the 
set S and the clean-up steps (steps 7 and 8) in 
algorithm MOD.CONNECT. Flag(i) = 1 indicates 
that vertex i is a supervertex. Flag(i) = 0 in- 
dicates that vertex i has been merged into a 
supervertex or is an isolated supervertex and 
should not be used in subsequent iterations. 
Thus only those vertices with Flag(i) = 1 (or in 
S) are involved in any given iteration. 


In order to visualize how the algorithm works, 
an informal description of the set S$ and the ar- 
rays A,D, Flag and C during each iteration are © 
given as follows: 


"Flag'' - a boolean vector of length n. 
Flag (i) = 1 iff vertex i is a super- 
vertex representing the group of vert- 
ices being merged to it. Vertex i is 
always the smallest numbered vertex of 
the supervertices. 


nS - contains the indices of the supervert- 
ices. 

Hip - a vector of length n. D(i) specifies the 
sSupervertex into which vertex i is 
merged. 

nicks - a vector of length n. C(i) specifies the 


smallest supervertex to which super- 
vertex i is adjacent. 


nA - ann x n symetric boolean matrix. Usu- 
ally we are only interested in the re- 
stricted A over the current S. 
A(i.j) = 1 if and only if there is an 
edge connecting supervertex i and super- 
vertex j. 


Algorithm MOD.CONNECT is a modified version of 
the algorithm CONNECT given in [12], a more de- 
tailed proof for the correctness of the algorithm 
can be found there. 


Since the number of flagged vertices (the num- 
ber of elements in S) is reduced by a factor of 
at least two after each iteration, we shall show. 
that by the technique of problem decomposition 
[13] the same time bound 0(log2n) can still be 
achieved by using less than n [n/log n] proces- 
sors. 


The reduction on the number of processors is 
based on the fact that certain operations have to 
be performed on the set S of the supervertices 
and not on all the vertices. However, in order 
for the processors to set themselves up so as 
they know which vertices are in S and be selected 
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to perform the various operations, an array, say 
Q,; can be set up such that Q(0),Q(1), : 
Q(M-1) represents the elements in S, where m=|$|. 
Q(j) for j runs from 1 to m would be used to re- 
place the condition ''for all jies'' in the algor- 
ithm. The array Q and m can be updated at each 
iteration in O(log n) steps by applying the fast 
parallel sorting algorithm described in [11,17] 
on the array Flag. Thus, step 2a in algorithm 
MOD.CONNECT can be replaced by calling the sort- 
ing procedure in [11,17] as 

2a  SORT(FLAG,Q,n) 
Procedure SORT sorts the input binary array Flag 
in time O(log n) with n processors, returns array 
Q with the property that Flag (Q(j)) = 1 for 
O <j <m and O elsewhere, where m is the number of 
I's in Flag (i.e. m= |S|). Besides the above 
changes, steps 2(b) and 7 are required to be con- 
sidered accordingly based on array Q. 


The following lemmas are useful in proving our 


results. 


Lemma 1: Given n elements {ag,a],..-, ap-j}and K 
processors, A(n) = ap*a;*...*aq-] can be computed 
in T time units, where * is any associative bin- 
ary Operation and i 


[n/K 1 - 1 + log K if [n/2 | >K 
T= . 
log n if Ln/2 | <k 
Proof: If K= In/2 J, it has been shown that A(n) 


can be computed in log n time units by the tech- 
nique of recursive doubling [9,22]. If. K <Ln/2 J 
we partition {ag,a], ...,an-j} into K groups, 
each of [n/K] elements, except that the last 
group has r = n - (K-1)[n/K] elements. Assign 
one processor to each group and then compute the 
groups in parallel. This takes [n/K]- 1 time 
units. These K results, one from each group, are 
then combined in parallel by the K processors, 
which takes another log K time units. Hence, the 
total time requirement is [n/K1- 1 + log K time 
units. 

[] 


Lemma 2: Let the n elements be partitioned into 
p sets and assume K processors available. The p 
products, one for each set, can be computed in at 
most T time units, where T is the same as given 

in Lemma 1 and * is an associative binary opera- 
tion. 7 


Proof: Align the p sets of elements as shown in 
| Figure 2 and partition the elements into K groups 
as in the proof of .Lemma I. 


| set | set 2 set 3 spt.p 
Roe | Xe eK KKK OK RKRO KK KTR wee | KKR TK. 
sages ees esckees| --=- k----> | eee 
gr. 1 2 3 4 K 
where k = [n/Kl elements 
r =n - (K-1) [n/Kl elements 
Figure 2: Partition of p sets into K groups 
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Assign one processor to each of the K groups to 
compute the products in that group. If all the 
elements in a group belong to the same set, one 
answer will result from that group. If the ele- 
ments in a group belong to several sets, say b 
sets, then b answers, one for each set, will be 
obtained. If b >2, at least b - 2 answers are 
final products and at most 2 answers in each 
group will be combined with answers in other 
groups to give a final product (the first and 
last groups each contribute at most one answer and 
the final product). For instance, (see Figure 2) 
groups | and 2 have one answer, group 3 has 2 an- 
swers and group 4 has 3 answers, one of them be- 
ing a final product. It is obvious that no more © 
than [n/K] - 1 time units are needed for comput- 
ing answers in each group. 


‘Let us assume that nj answers will be combined 
7 give the produce of set i. (In figure 2, 
= 3, n2 = 2, and nz = 1. ) Assign Ln;/2J 
es to each set to compute the product of 
that set. Since XP_,[ (nj-1)/21<K, the total 
number of pnocessors required will be less than K. 
Each set will take another log n; <log K time 
units to obtain the final product. Thus the tot- 
al time requirement is still no more than T as 
given in Lemma 1. 
] 


Lemmas | and 2 give an upper bound on the paral- 


lel time complexity for computing a product of n 


elements and products of sets of n elements. As 
a matter of fact, it can be shown easily that this 
bound is at most one time unit from optimal [16]. 
Since the ''Min'' operation in step 2b and the ''OR" 
operation in step 7 are associative binary opera- 
tions, Lemmas | and 2 give an upper bound on the 
total number of time units spent in these steps. 


Lemma 3: Given nK processors, step 2b in algor- 


ithm MOD.CONNECT takes at most 0(n/K+log nlog K) 


an if 1 <K <En/2 land 0(log2n) time if 
> la/2 J 


Proof: As mentioned earlier or from [10,12], 
further iterations of steps 2-7 merge supervert- 
ices. Step 8 eliminates those merged vertices 
which are no longer supervertices. It is proved 
in [12] that the number of supervertices (flagged 
elements) i.e. |S|, in each connected component 
decreased by a factor of at least two after each 
iteration until the connected component is repre- 
sented by a single supervertex. Moreover, if the 
whole connected component has merged to a single 
supervertex (i.e. the supervertex will be isolat-— 
ed), that supervertex will not be considered in 
the succeeding iterations since its flag is set to 
zero at step 3 in the iteration at which it be- 
comes isolated. Thus, we have n flagged elements 
at the first iteration (i.e. m=n) and have at most 
Ln/2i] flagged elements after i iterations. At 
step 2, in order to compute all C(i), K processors 
are assigned to each i to compute the minimum val- 
ue among at most|S| elements. Since ''Min'' is an 
associative binary operation, we can apply Lemma | 
to evaluate the time complexity. 


The program for step 2b can be described as 
follows: 


The following steps are performed in paral- 


2b 
lel for 0 si< m,Osj<K, since m <n, the max- 
imum number of processors is nK. It is as- 
sumed that M=[m/K]1. m=|S| and 
= {Q(0),0,(1),...,Q(m-1)} after step 2a. 
(1) for k<-- 0 until M-1 do 
~ for all i,j i,j. do 
if (A(Q(i), Oj M+k) )=1 AND 
Flag (Q(jM+k))=1) then 
Temp (i, jM+k) <--.Q(jMt+k) 
else Temp(i, jMtk) <-—-- y 
(2) for k<--l until M-1 do 
~ for all i,j. do 
~ Temp(i,jM) <-- min{Temp(i,jM), 
Temp(i,jM+k) } 
(3) for k<-- 0 until (log K)-1 do 
~ for all 7,j, do 
Temp(i,jM) <-- min {Temp(i,jM), 
Temp (i, ((j+2‘)mod K)M) } 
(4) for all i do 


if Temp(i,0) = y. then C(Q(i)) <--Q(i) 
else C(Q(i)) <-- Temp(7,0) 


In the above program, yY stands for any number 
exceeding n-I. In step (1), the elements whose 
minimum is to be computed are stored in the array 
Temp. In step (2) the minimum values for all the 
groups (the number of groups < K and the size of 
each group<M) are found in time O(m/K) via seq- 
uential search and all these groups are processed 
in parallel. Then, the overall minimum of the 
K minima is found (step (3)) in time O(log K) 
using at most mK processors at each step. The 
details for the time complexity are as follows: 


Case 1: I<K<f[n/2]l, since |S | is reduced by at 
least half after each iteration, | S| is at most 
2K after t = log n - [log KI iterations. Thus, 
we have the following time bound, T. 


~p (hl n/28J)/KT -1+1ogk) +209 7! log (n/2‘) 


"er an/K + tlog K + (log (2. 
<0(n/K + log nlog K) 


THE) 


Case 2: K >Ln/2J, we have 
T =rlog Sieg (ns2") = 0(10g“n) 

K=0 [J 
Lemma 4: Given nK processors, step 7 in algor- 


ithm MOD.CONNECT takes at most O(n/K + log“n) 
time units if 1 <K<Ln/2J and O(log4n) if 
K 2Ln/2J, 


Proof: After sets of supervertices are merged in 
steps 4-6, the adjacency information among the 
Supervertices is updated in step 7. Basically, 
step 7a puts an arc from vertex i to new super- 
vertex j (i.e., A(i,j)=1) if there is an edge 
between i and a vertex merged into j. Step 7b 
puts an arc from supervertex i to supervertex j 


173 


if there is an arc to j from a vertex merged in- 
to i. In step 7a or 7b those columns or rows in 
the adjacency matrix A corresponding to those 
vertices which are merged to sSupervertex j or i, 
are ''OR''ed together to give the new column j or 
row i. Since ''OR'' is an associate binary opera- 
tion, Lemma 2 can be applied to derive the time 
bound for step 7. There are m rows in A which 
correspond to S and these rows of elements are 
handled in parallel. As in step 2b, K proces- 
sors are assigned to each row i to compute 
A(i,j) in step 7. 


Since the application of Lemma 2 assumes that 
the elements in the same set are grouped togeth- 
er, we have to apply the parallel sort algorithm 
in [12,17] on array D. As a consequence, all 
the elements which have the same value in the 
array D are grouped together by the following 
procedure call 


SORT (D-FLAG, Q,m) 


The input array is the inner product of the ar- 
rays D and Flag (i.e. the i'th element equals 
D(i) if Flag(i)=1 otherwise 0). Array Flag is 
used such that only those elements corresponding 
to the supervertices are considered. Since the 
information corresponding .to the isolated super- 
vertices need not be merged with any other super- 
vertices, their corresponding Flag values have 
been assigned to 0 in step 3 and they will ef- 
fectively be ignored. Procedure SORT basically 
arranges the supervertices according to their 
values in D and all those elements with Flag(i)= 
O are put at the end of the list. After the pro- 
cedure call, array Q has the property that 
D(Q(j)) = D(Q(i)) if j >i and Flag(j)=Flag(i)=1. 
The program for step 7a (similarly for step 7b) 
can be described as follows: 


7a The following steps are performed in parallel 
for 0 <i <m, 0 <j <K. Since m <n, the max- 
imum number of processors required is nK. 
It is also assumed that M=[m/K 1. 
(1) SORT(D:Flag,Q , m) 
(2) for all j do Temp(j)<-- Q(jM) 
(3) for k<-- 1 until M-1 do 
for all i,j. do 
if D(Temp(j))=D(Q(jM+k) AND 
Flag(Q(jM+k))=1 then 
A(Q(i) ,Temp(j))<--OR{A(Q(i) ,Temp(j)), 
A( Qi), Q( jM+k) )} 
else Temp(j)<-- Q(jM+tk) 
(4) for all i,j do 


ae & p(tempG))= D(Q((j+1)M)) AND 

Flag(Q((j+1)M))=1 then 

A(Q(i) ,Temp(j))<-- ORTA(Q(i) ,Temp(j)), 
A( Qi), Q( (j+1)M))} 


(5) for k<--0 until (log K)-1 do. 
for all i,j do | 
if D (Temp(j)) = D(Temp({ (J#2k). mod 
~K)M)) then A(Q M3 TenpNDIsac OR 
{A(Q(i), Temp 


A(Q(i), "at (Ce26) mod k)M))} 


for all i,j do A(Temp(j) ,Q(i))<-- 


~A(Q(i) , Temp(j)) 


As in the proof of Lemma 2, the elements are 
partitioned into K groups each of which has M 
elements. In step (1) the elements are stably 
sorted such that all the elements with the same 
value in D (the same D-value) are grouped to- 
gether (this refers to those elements which wil] 
later be merged together). The first element in 
each group is assigned to the array Temp in step 
(2). The columns of A with the same D-value in 
each group are ''OR''ed together sequentially in 
step (3). The resultant column is stored at 
A(*,Temp(j)), where Temp(j) always remembers the 
index of the first column in the j'th group among 
all the columns which have the same D-value.* In 
step (4), the two resultant columns which have the 
same D-value in two adjacent groups are ''OR"ed to- 
gether. In step (5), all the resultant columns 
with the same D-values are ''OR''ed together and the 
final resultant column is stored in the smallest 
indexed resultant column. Since step (1) evokes 
a stable sort [11,16], it is easy to show that the 
smallest numbered column, say j, has the property 
that j = D(j), (i.e. it will become the supervert- 
ex in the later iterations). 


(6) 


During the first iteration, the K processors of 
one row must deal with n elements; and for each 
succeeding iteration, the number of elements to 
be dealt with by the K processors is at most half 
of the number in the previous iteration. Thus, 
using Lemma 2 and applying the same kind of analy- 
sis as in the proof of Lemma 3, we derive the time 
bound T as stated in the lemma. 

f] 


Theorem: Algorithm MOD.CONNECT finds the connec- 
ted components of an undirected graph with n vert- 
ices in time 0(n/K + log2n) using nK processors 
where K2 1. 


Proof: The time and processor requirements are as 
follows: 
Step Total Time Processors 
1 <K<In/Z] K2 In/Z 
] 0(1) 0(1) n 
2a 0(log2n) 0(log2n) n 
2b 0(n/K+(log n) 
| (log K)) 0(log@ n) | nk 
3 O(log n) O(log n) n 
in 0( logan) O(log n) n 
5 0(log2n) 0(log2n). n 
6 O(log n) 0(loggn) n 
7 0(n/K+log2n) 0(log2n) nk 
8 O(log n) O(log n) n 


Thus, nK processors suffice to determine the 
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directed graph in time O0(log2n). 


connected components of an undirected graph with 
n vertices in time O(n/K + logn). 


[] 


As a by-product of our main theorem, we have 
the following result. | ‘ 


Corollary: Given nf n/log“rl : processors, algori- 
thm MOD.CONNECT determines the connected compon- 
ents of an_undirected graph with n vertices in 
time O(log“n) . 


This method uses the least number of processors 
yet to find the connected components of an un- 
The previous 
method [12] needs nfn/log nj] processors to ach- 
ieve the same time bound. It can be shown easily 
that if K = 1, i.e. n processors are available, 
algorithm MOD.CONNECT takes O(n) time. If less 
than n processors are available (i.e. K <1), each 
parallel step will be repeated [1/K] times and 
the total required time will be O([n/K]). Asa 
matter of fact, this algorithm takes O(n2) time 
with 1 processor (i.e. K = 1/n) and also, this 
algorithm is optimal in the sense that the speed- 
up ratio is linear with the number of processors 
available as long as the total number of proces- 
sors iS no more than n [n/log“nl . 


IV. CONCLUSION 
We have proposed algorithm MOD.CONNECT to find 
the connected components of an undirected graph 
and have derived a time bound for the algorithm 
using a fixed number of available processors. 
can also show that several related problems can 
be solved in the same time and processor bounds 
[4]. In particular, these problems can be solved 
in O(log2n) time using nfn/log¢nl processors. 
This method is superior to the previous methods 
[19,20] because it uses the least number of pro- 
cessors for the same time bound. The technique 
employed in our algorithm is a kind of problem 
decomposition which is similar to what is used in 
[19] for finding the miminum element in an array 
of n elements. It exploits the property that the 
problem size is reduced by at least half after 
each iteration and thus the processor requirement 
can be reduced by a factor of log n over existing 
algorithms. However, other problems, such as 
finding the transitive closure of an asymmetric 
Boolean matrix and the strongly connected compon- 
ents of a directed graph, can be shown to be re- 
ducible to the matrix multiplication problem 
[3,18], cha time complexity is 0(n2*°!log n/k) 
using K <n2-9l/log n processors with $,=0(K/log n) 
Since the Bigs of the problem remains constant 
after each iteration, the idea of reducing the 
number of processors by a factor of log n is not 
directly applicable. It remains an open problem 
to determine whether there exist algorithms for 
these problems whose speedup ratios are. | inear 
with respect to the number of processors avail- 
able. 


We 
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SPEEDUP BOUNDS FOR CONTINUOUS SYSTEM SIMULATION 
ON A HOMOGENEOUS MULTIPROCESSOR 


E.H. D'Hollander 
State University of Ghent 
Department of Applied Mathematics — 
Coupure Links 533. | 
B-9000 Ghent, BELGIUM 


Abstract -- This paper explores the benifits 
and the bounds of multiprocessors for the simula- 
tion of continuous systems. Different types of 
parallelism are defined describing the stepwise 
refinement of a problem into parallel executable 
tasks. Invariant simulation systems have a great 
parallelism in time, due to their periodic execu- 
tion for each integration step. When a problem can 
be partitioned into tasks which are independently 
scheduled, it has natural parallelism. A problem 
structure having precedence constraints among 
tasks exhibits functional parallelism and finally 
a task which further is split-up in atomic opera- 
tions exploits the operator parallelism. For each 
of these forms the processor utilization and speed- 
up bounds are analysed with respect to the struc- 
tural characteristics of the simulation problem. 


1. Introduction 


The idea to apply multiprocessor systems in 
the domain of continuous system simulation is mo- 
tivated mainly by the following considerations. 
First, the need for fast simulation power is re- 
cognized in many applications, but it is most 
stressed in the field of interactive simulation 
and in real time systems. Second, most digital 
simulation is cpu-bound, since numerical integra- 
tion of a complex set of differential equations 
is calculation-intensive. Because of the heavy 
cpu-load, the use of several processing units is 
likely to produce a faster solution. However, the 
final speedup is bound by the processor-system as 
well as the problem characteristics. The aim of 
the following sections will be to focus on the 
problem dependent characteristics which influence 
the potential speedup on a MIMD-machine. After a 
general problem formulation in section 2, the dif- 
ferent types of parallelism will be defined in 
section 3. In section 4 the problem-dependent fac- 
tors, limiting the unconstrained use of parallelism 
are discussed and some useful bounds on cpu-utili- 
zation will be derived. Attention is given to the 
architectural aspects where they might constitute 
a potential bottleneck. 


2. Problem Definition 


A simulation model S, 
following set (Fig. 2.1) 
~- the time, t ; 
- the input-set, x ; 
- the state variables, gq ; 
- the output-variables, y ; 
- the derivative functions, f ; 
~ the output functions, g. 
We consider a general multiprocessor MP, con- 
sisting of : 
- n identical processors ; 
- m memory-modules 3; 


is described by the 
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- an interconnection system I, which describes 
the coupling between processors and memories. 
A multiprocessor is coded by the software P, 
yielding a programmed computer system, K : 


K = {n,m,1I,P} (2.1) 


When a computer system K solves the simulation 
problem S, the code for executing S is partition- 
ed and allocated to the different processors by 
the mapping T 


P = 1(S) (2.2) 


A programmed multiprocessorsystem K, has an exe- 
cution time, t. which is function of 


- the machine dependent characteristics, n,m,I; 
- the implementation of the problem, 1(S) 


t, = F[n,m,I; (8) ] (2.3) 


e 
The minimization of t, therefore depends on 
machine- and on problem-characteristics. Whereas 
the architectural aspects have received ample 
considerations in the literature, especially for 
the possible interconnection structures, this 
paper contributes to the equally important parti- 
tioning problem. From the results obtained, it 
should be possible to select the proper n, m and 
I, in order to tailor the multiprocessor to the 


type of problems it will solve. 


t 
> y=g(q) 
x ——>}. 
Fig. 2.1. The continuous system S = {t,x,q,y,f,g} 


3. Types of parallelism 


Se Partitioning steps 


Fen Renee OI fe OLY SNS CUED wemle LENS CAND ED oe ee wae cee 


The partitioning, P = 1(S). proceeds in two 
distinct steps (Fig. 3.1). 


Ty | up) 


§ ———> C= {J a eas P 
Figs 3.1. Silden seston (Ss), Task eyacem (C) 
and Processor-coding (P) | 


In the first pass 1,, the model S is transformed 
into a task system C = {J, <}. A task system con- 
sists of a set of tasks, J = {T. } subject to an 
ordering <-. This ordering inditates the ‘prece- 
dence constraints governing the execution of J. 

A task T;, operating on the results of task Tj, 
requires the prior termination of T; before task 
Tj can initiate. This execution ordering is 


denoted by T, < T.. In the second step 1, the 
task system C is scheduled and programmed on the 
available processors, by the coding P. This im- 
plies compilation and downloading of the tasks T;, 
together with the necessary synchronization primi- 
tives. The scheduling strategy has to take into 
account the precedence constraints of the task 
system, the duration of each task, and eventually 
the parallelism within a task T;. Given an un- 
supervised scheduling algorithm, step tm) is total- 
ly transparent to the user, whereas normally a 
limited user-assisted partitioning occurs in step 
™,- Since the number of processors n, is only in- 
troduced in the unsupervised partitioning step 19 
this approach permits a graceful degradation. 


3.2. Parallelism in time 


Whenever a simulation problem S has a deter- 
ministic structure, its task system will be iden- 
tical for each time step. Consequently, the sche- 
duling and compilation is the same for all inte- 
gration steps and needs to be done only once. 

On the other hand, if the structure of the problem 
varies according to state changes during execution, 
the parallel implementation requires a reschedul- 
ing of the task system, in order to account for 
any variations in the ordering <*, the duration 
times or the task set J. The influence of these 
structure-variations is estimated by the 'paral- 
lelism in time’. The time-parallelism is defined 
as the average number of integration steps during 
which the problem structure is fixed. A high pa- 
rallelism in time justifies an elaborated optimi- 
zation of the scheduling strategy. 


Each integration step of a set of differen- 

tial equations requires successively : 

1) the numerical integration of the state vec-— 

tor (q) 
2) the calculation of the derivative functions 
4 = £(q,x,t). 

The corresponding set of tasks J, generally can 
be partitioned into several subsets B: of tasks 
T; > which do not interact during the execution of 
an integration step, i.e. : C = {B,O} with B={B.} 
Consequently, all sets B; can run concurrently 
without synchronization. This form of parallelism 
is termed 'natural', and it is quantified by the 
number of subsets inB. Many simulation systems 
exhibit a natural parallelism, since they can 
split up in logically independent subsystems. 
‘A prominent example of natural parallelism is 


given by the independent state equations (Fig.2.1). 


First all state variables can be integrated in 
parallel. Then the state vector is communicated to 
the different processors and finally each deriva- 
tive function is evaluated simultaneously [7 ]. 


When there exists precedence constraints be- 
tween the tasks of a simulation system or subsys- 
tem, the ordering relation <- is not empty. The 
parallelism which respects this ordering < is 
termed 'functional'. It is not possible to extract 
this functional parallelism immediately from the 
problem description S, or during phase 1], where 
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the task system C is created. It has to be recog- 
nized in the phase To, and the partitioning algo- 
rithm has to take into account the absolute prio- 
rities between executing tasks, the task duration 
times and the communication~- and synchronization- 
overhead. Typically this parallelism is applied 
to the parallel execution of derivative function 
calculations. 


A fourth form of parallelism, also occuring 
in the partitioning phase t>, arises when a task 
T., 1s further split up, in order to increase 
parallel execution. The operator parallelism gives 
rise to two subforms the micro- and the macro- 
operator-parallelism, depending on its effect on 
the structure of the whole task system C. In the 
micro-form, the task T. is searched for parallel 
executable basic-operators, such as multiplication 
and addition ; the structure of the task system is 
then changed only locally [17],[18],[19]. The 
macro-form has a profound effect on the global 
task system structure. There the splitting of ope- 
rator T; results in a split-up of the whole task 
system. This happens for certain parallel integra- 
tion algorithms, e.g. when the integration formu- 
las allow the system to be evaluated simultaneous— 
ly for two consecutive timesteps [7 ],[21],[23]. 


4. Problem-dependent performance bounds 
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Clearly the major advantage of most simula- 
tion systems with respect to parallel processing, 
is the repetitive execution of the same calcula- 
tions during each timestep. Ideally this requires 
that task duration times and problem structure 
(i.e. the precedence constraints) remain invariant 
during the whole integration interval. The condi- 
tion of a time-invariant structure, however, can 
be relaxed to include continuous systems which 
switch over during execution time between a limit- 
ed number of alternating structures. In this case 
all possible structures are partitioned in advance 
and downloaded into the different processor-memo- 
ries. In this way the jump to a new structure, even 
as a result of a state-change in the model, can be 
realized with minimal overhead, similar to a sub- 
routine-call in sequential processing. However, 
this method fails in two cases. First the 'context- 
switching’ becomes predominant whenever the model 
rapidly alternates between several different struc-— 
tures, i.e. when the time-parallelism is low. 
Second, the number of possible structures grows 
exponentially with the number of 'switchpoints’ 
in the model : these are’ the points where a selec- 
tion is made between alternate functions to evalu- 
ate (compare with the switches in an analog block- 
scheme). Few analytical results exist on the in- 
fluence of variable task duration times. Several 
simulation results, however, seem to indicate that 
slight variations on the estimated task length 
have only a marginal effect on the scheduling effi- 
ciency [1 ]. Task lengths can be estimated at com- 
pile time from the duration of the individual in- 
structions [20]. 


The general assignment problem can be stated 
as follows. We are given : | 
1) a task system C = {J,<} in which 
J = {T,,..., Tw} equals a set of tasks and 
< is the partial ordering relation ; 
Tj < Tz denotes that T; cannot start execu- 
tion prior to the complétion of T; ; 


2) a weighting function a(T.), representing the 


execution time : ee a. ( 3) $. 


3) a fixed number of identical processors, n. 


The objective is to find a partition A,, ..., 
A of J, such that the largest execution time on 
any processor 


T.€A, 
J 1 


max 


} (4.1) 


max { Te 
Vi J 
is minimized, subject to the precedence con- 
straints, <. 

It is well known that for general values of n 
and m, this problem is NP-complete [14],[22]. 
Therefore considerable attention has been given 
to the development of fast heuristics, yielding 
suboptimal results [1 ]1,[5 ],[9 ]. The general 
problem formulation above, involves the detection 
of natural parallelism (<-= 0) as well as func- 
tional parallelism (<< # 0). In the following 
paragraphs, two common heuristics for this parti- 
tioning problem will be analyzed. : 


Pad 
~— 


Natural parallelism (<“= 0). Im this case, 
the tasks are independent. Intuitively it seems 
useful to assign the longest tasks first. In this 
way the smaller tasks are reserved for the end, ° 
and can be used to reduce the irregularities of 
the distribution. This leads to the ‘Longest Pro- 
cess Time' algorithm [4 ] | 

1) arrange the tasks in a list, in decreasing 
order of execution times ; 
2) assign each task from this list consecutive- 
ly to the first available processor. 
This is a so-called 'list-algorithm'. The list al- 
gorithms differ only in the way a list of tasks is 
arranged. In contrast to an optimal search by 
enumerative techniques, the LPT-algorithm is rela- 
tively efficient. The sorting of the list takes 
O [N.log(N) ] steps, and the assignment phase re- 
quires O [N(n-1)/2 ] operations. For large N and 
constant number of processors n, the algorithm is 
of order O[N.log(N) ]. 


Functional parallelism (<# 0). The ordering 
relation <-, governing the execution priority of 
the tasks, is represented by a task graph. This is 
the tupple [J, oa(T), <- ] and consists of vertices 
denoting tasks, and edges denoting the precedence 
constraints. Again a list algorithm is applied for 
the automatic scheduling. In order to account for 
the precedence relations however, the weight. of a 
task Tj; is measured by its level. A task T; has 
level &, when the longest path from that task to 
a terminal task requires 2 time-units. Consequent- 
ly, &(T;) is the minimal time needed to terminate 
the execution of the task graph, from the beginn- 
ing of task T;. It is intuitively appealing to 
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- tem. 


assign the highest priority of execution to the 
tasks with highest level, i.e. to those tasks 
with the largest workload ahead. This leads to 
the following ‘level algorithm’, which originates 
from the optimization of assembly lines [13]: 

1) arrange the tasks by decreasing levels ; 

2) whenever a processor becomes free, assign 
that task of which all predecessors are 
executed, and which has the highest level 
of the remaining tasks. Ties are arbitrari- 
ly resolved. 

The workload of this algorithm depends on the num- 
ber of tasks N, the average number of predecessors 
of each task, Npred, and the number of processors 
n ; for moderate values of Npred and n, however, 
the algorithm is O[N’/2 ]. 


It is not a rule that a multiprocessor of n 
identical processors will perform n times faster 
than a single processor. Although several archi- 
tectures bear this potential, even the best equip- 
ped systems will be more or less seriously limit- 
ed by the constraints of the problem. It is the 
aim of this section to derive the lower and upper 
bounds of processor utilization with respect to 
the problem characteristics. Several authors have 
demonstrated the suboptimality of the LPT- and 
level-algorithms [11],[13],[15]. Here we concen- 
trate on the suboptimality conditions for the 
processor utilization taking into account algo- ~ 
rithmic-, problem- and processor-characteristics. 
We define the following performance measures. 

The effective execution time te, is the process- 
ing time of the longest operating processor. 


The minimal execution time tpjn, of any problem 
on an n-processor system is ; 


= E/n 


ic 


1 % 
The processor utilization U is defined as the 
average fraction of time that the processors are 
busy during te 


ei eae /t 


t (4.2) 


min 


with E the total workload of the task sys- 


min e | (4.3) 
The speedup S of a n-processor system over a uni- 
processor is S = U.n. 


Upper bounds. The effective calculation time 
of a task system is bound below both by the long- 
est chain of tasks to be executed serially and by 


the number of processors n, i.e. the degree of 


hardware parallelism. When there are no precedence 
constraints and we do not allow pre-emption, one 


has 


=  < 
max ce E/n} for : 0 


t i 

e,min 
with T the longest task duration. In the case 

max a : ; : 
of precedence constraints, the effective execution 
time is bounded below by the longest path in the 
task graph. According to the previous definition, 
this is the highest task-level L : 


= max {L, E/n} for @ #0 


t é 
e,min 


For obvious reasons it is assumed that the number 


of tasks exceeds the number of processors, l.e. When the number of tasks N, grows indefinitely, 


N2n. Define the average task length Tt = E/N. and the mean task length T exceeds an arbitrary 
Then the processor utilization and speedup have value e > 0, (4.9) yields lim t =t.. . 
the following upper bounds : . Neinf LET ar 
Therefore, the LPT~algorithm 18 asymptotically 
U.., Stine Tk oR SSO a OPE 
max tet 
max 
. . ‘Let 
Snax 7 min es » n} “oot 
max 
and 
+, + 
Unay 7 min {= , 1) for<#0 (4.5) ene 
; E 
max ~ min {- » ti) 
Lower bounds. First we consider the indepen- 
dent task system (<* = 0). Denote by t; the start- +. 146+81 
ing time of task. T.. The minimal execution time Graham 
tnin given by (4.2°) requires that all processors 
remain active during the interval [0,t,;, ]. 
In each of the list-algorithms, the last task Ty 
1s started on the first available processor, 
at t Thus, till t, all processors are busy exe- 
eutane the previous N-1 tasks, yielding an upper 
bound for ty : | +. 1GE+@1 —+ “opt 
ea +. 1@E+@1 +, S@E+@1 +, 9@E+@1 
< ( ) t.)/n (4.6) Fig. 4.1. Comparison of the execution-time bounds 
i-l 7 for independent-task systems (<“= 0), 
Let tg anaes qT, be the last tasks executed on sencaunee by Ene Titalsorrehm 
each of the n mrOeenaons: These tasks start ulti- With, U.S to 3. if topt» the lower bound for 
mately at ty. From that moment the execution will the processor utilization is given by the inverse 
not last longer than the maximal duration of these of the upper limit in (4.9) and 
n tasks, Eine with Tb) max > max Ty « NT) ax 
: ? jzl,n “jj S. =U..n=n/ {1 + ——  } 
This gives an upper bound for the execution time ee ee 
of each list-algorithm : t. < ty * Ty, thax’ Taking for <= 0. 
into account inequalities (4. 6) and (4. Z)5 For the general case where <*# 0, the long- 
Caen > ty and the execution time of the LPT-sche- est path, L, defines the minimal execution time 
dule is bounded by te, ae which is independent of the number of pro- 
: <t re (4.7) cessors n. When there is no bound on the number 
LPT min & max of processors, all other tasks can run concurrent-— 


ly with the longest chain tasks, yielding a total 


monstrati = le al termi- : : 
SeHQUAE ES Ene PHA ObNe tar ccuceus eeu aye Serle execution time L. In the worst case, however, 


nates: Wace Te max Ob ERE a peorute minima oP due to precedence constraints, no tasks can be 
cution time. With topt > tins equation (4.7) also executed in parallel with the longest path. In 
yields a suboptimality bound for the algorithm : order to be consistent with the definition of 
tier ee longest path, this requires that the remaining 
——— S£] + —— (4.8) tasks can be executed in zero time, by an even 
opt opt — partitioning over an infinite number of processors. 


This collection of remaining tasks is called an 
‘impulse task'. Consider a task system having a 
total task duration E = 1, and a longest path L <I. 


This bound is comparable with the Graham bound 


[11] : 
t 


Lpr < 4 ee In the worst case, this problem requires the exe- 
t 3. 3n cution of an impulse-task of I-L time units, after 
opt , ; 
the execution of the longest path. Since an impul- 
Both bounds are represented in Fig. 4.1, for large se-task can be divided evenly over n processors, 
values of n (n > 10) and with the normalization this gives an additional workload of 6 = (1-L)/n 
T,) = 1. This figure illustrates that for time units (Fig. 4.2). The total execution time 
sac is te =L+ 6. With t, 1/n, the processor uti- 
Sc ~] juati : e = , 
copt 3n/(n~1) ve ymax? equation (4.8) gives a lization of the woret" eee U = /tmin/te> gives the 
lower bound. Moreover, (4.7) can be written as : lower bound 
t n. T uU. #1/[1 + (n-1)L J]. 
LPT < 1+ mes (4.9) min 
min Ne This lower bound is also related to the mean width 


a, 


P2 
ees i 


PI 


6) L Les 


Fig. 4.2. Worst case task system subject to prece- 
dence constraints (<*# 0) 
U 

of the task graph, W, which is defined as follows. 
Suppose the task graph is executed on an unlimited 
number of processors. At each instance of time, t, 
t © [0,L], there are w(t) processors active, 

w(t) equals the number of parallel executed tasks 

at time t, and is conveniently called the width of 
the task graph. The mean width, W, is now defined: 


ob 
W=— f w(t) dt 
u 0) 


The integral value represents the total active 
processor time, which clearly equals the total 
task duration time E, so W = E/L and consequently 
with E = l, | 

. = 1/1 + (n- 

ae 1/[1 + (@-1)/W ] 
This is the best possible bound, since one always 
can construct a taskgraph with longest path L, 
giving minimal utilization on n processors. 
The lower bound for speedup becomes :; 
n-1 


S.. =n/{l+—-——] 


<e 
min W ao 7 0 


4.4. Parallelism and Communication 


The degree of parallelism and the degree of 
communication are strongly interconnected. There- 
fore, the interconnection network - mainly used 
for traffic between processors and memories — 
should be tuned to the type of parallelism which 
is exploited. Parallel program execution can be 
static or dynamic. Analytical and simulation stu- 
dies [3 ],[12] have demonstrated that simultaneous 
execution of programs in an n processor, n memory 
system, coupled through a crossbar switch, can re- 
sult in a significant loss of efficiency when pro- 
cessors randomly access memories other than their 
preferred memories for the execution of instruc— 
tions. Parallel continuous system simulation, how- 
ever, mainly involves the repetitive execution of 
static programs which are assigned permanently to 
the same processors. Therefore, the programs can 
be stored in private memories, one for each pro- 
cessor. Using private memories for program execu- 
tion, the total bandwith of the interconnection 
system becomes available for data communication. 
Memory access can be scheduled or arrive at random. 
Some authors suggest the scheduling of tasks of a 
highly structured taskgraph could take into ac- 
count the possibly hierarchically structured com- 
munication paths, in order to program highly 
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interactive tasks on 'nearby’-processors [2 ]. 
However, for general simulation problems this 
complicates unnecessarily the scheduling algorithm. 
Indeed, communication conflicts should be more an 
exception than a rule, so their minimization 
through a well-balanced partitioning will normal- 
ly have only a marginal effect. In fact, the hard- 
ware interconnections have to provide the requir- 
ed support for the statistically expected traffic 
load within reasonable efficiency bounds. There- 
fore, a homogeneous multiprocessor system with 

a non-hierarchical bus-structure is considered. 

In order to estimate the impact of the intercon- 
nection structure, a single bus-structure queu- 
ing model is considered as an example (Fig. 4.3). 


Q 1 
‘Ss 


1/A 


Fig. 4.3. Queuing model of an n-processor, 
1 shared memory architecture 


The model consists of n identical processors, re- 
questing information from the shared data-memory, 
which is the server. The service time 1/y, depends 
on the number of variables which are written into 
or read from the memory. These are the numbers of 
input variables of a task plus one output variable 
1.e. typically O - 10 memory cycles. The system 

is self regulating (closed loop), since request- 
ing processors in the queue become non-active un- 
til they receive service. Since a processor re- 
quests service at the end of each task, the mean 
request-interarrival time, 1/A equals the mean 
task-execution time. Assuming exponential service 
and interarrival times, this queuing model yields 
an estimate for the effective response time R, 
which is function of A and up [16] : 


Rt Ne 
R = wI=p,) 1/r (4.10) 
with 
7 n n! =| 


the probability that the shared memory stays idle, 
and : | 


_ average communication time per task 


mean task duration 


The impact of bus conflicts on the effective 
speedup of n parallel operating processors is | 
given by the efficiency factor 

I/k + 1/u (1p) (1-p,) 


‘bus 1/A +R nn. 


and which is shown in Fig. 4.4 for various values 
of n. 


"bus | 
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Fig. 4.4. Bus efficiency in function of inter- 
task communication (9) and number of 
processors (n) 


The figure illustrates that low values of p (< .1) 
reduce the influence of bus-conflicts. In order to 
hold the queueing time below 10%, (npy, 7 -9)> 
critical values of o are given in table I. 


Table I. Maximal p-values for Nea > .9 


From this table a rule of thumb, p.n< 1! can be 
derived. This inequality stresses the bounds for 
parallel task execution in a one~bus interconnec- 
tion structure. Moreover the result is robust with 
respect to varying distributions of arrival and 
service times as is shown analytically [16] and 

by simulation [6]. From the rule it is possible 
to predict quantitatively the communication per- 
formance of a task system on a single bus-multi- 
processor in terms of the average duration and 
mean communication time of a task. The bound es- 
pecially applies to micro-operator-parallelism, 
since the duration time of basic operators such 

as addition and multiplication may be of the same 
order as the data-transfer time. Unfortunately, 
few publications take into account this communica- 
tion overhead, which may well exceed the effective 
calculation time [8 ]. It is noted that an m-port 
memory or m multiple memories interconnected by 

a crossbar reduce the bus-conflicts significantly, 
provided the memory accesses are spread equally 
over all communication paths by an appropriate 
partitioning of the shared variables over the 
available memories. Using the queuing network 
theory of Gordon and Newell [10] for exponential 
distributions one finds a similar result:p.n/m < 1 
[6 ]. In order to estimate the relative influence 
of communication and precedence constraints, we 
consider a unit task with longest path L = .05, 
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or mean taskgraph width W = 20. The speedup is 
the inverse of the execution time of a unit task. 
The maximal and minimal bounds on a single bus, 
n-processor system are shown in Fig. 4.5, using 
the values p = O (no communication overhead), 

0 = .025 and p = .05. Apparently bus conflicts 
cause a serious efficiency loss when n> 1/p. 

The maximal speedup is 20, ideally achieved with 
20 processors. Interestingly however, in the 
worst case a doubling of the processors allows 
the speedup to become 12, i.e. 60% of its maximun, 
provided the connection system is not saturated 
(p = <.025). 
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Fig. 4.5. Ideal and worst case performance of a 
taskgraph with mean width W = 20 
(longest path L = .05) on n processors, 
1 shared memory. 
Upper line : maximal speedup S. 
Lower lines : minimal speedup S with 
no (9 = 0), moderate (p = .025) and 
high (9 = .05) bus traffic. 


5. Conclusion 


Continuous simulation constitutes a fruitful 
application to parallel processing techniques, 
mainly because of its invariant and repetitive 
tasksystem, i.e. its parallel structure in time. 
The state equations are independent tasks that 
can be distributed evenly over the available pro- 
cessors by a simple, efficient and asymptotically 
optimal LPT-scheduling algorithm. More parallelism 
can be gained in building up a tasksystem of the 
derivative functions, thereby introducing prece- 
dence constraints between tasks. This functional 
parallelism can be scheduled transparently to the 
user by a numerically simple, yet powerful level 
algorithm. Minimal and maximal speedup bounds 
have been derived in function of the longest path 
L, or equivalently the mean task graph width W. 
Further refinement of tasks into basic operators 
raises the problem of communication overhead. 
Queuing analysis of a single bus interconnection 
between processors and a shared data memory re- 
veals that the duration of the average task should 
exceed n times its communication time, where n 
equals the number of active processors. This re- 
sult can be generalized for other interconnection 
structures by the theory of closed queuing net- 


works. 
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BEHAVIOR OF PARALLEL ALGORITHMS 
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Abstract -- A probabilistic model of a class 
of parallel programs is used to investigate the 
counterintuitive behavior observed for some 
parallel algorithms. Two main points are made: 
(1) It may, in general, be beneficial to consider 
using more logical processes than physical pro- 
cessors in a parallel algorithm; and (2) Results 
from order statistics are useful tools in analyz- 
ing parallel systems. 


1. Introduction 

Certain strange phenomena have been reported 
recently regarding the behavior of parallel 
algorithms on real multiprocessors, such as C.mmp 
and Cm* [2,6,8,10], and an interesting problem is 
the development of models and analytical tech- 
niques to explain them [8,9]. 
develop a realistic probabilistic model for 
describing how members of a class of "decompos- 
able" problems behave when solved by certain 
parallel algorithms. Our contribution is not 
simply in the explanation of observed phenomena, 
but also in the introduction of order statistics 
as an analytical tool not ordinarily used in 
performance evaluation of computer systems. 


Anomalous problems that have been reported 
in the literature can be classified into two 
categories. On the one hand are problems that 
are solved in parallel by decomposing them into 
a number of subproblems, the successful comple- 
tion of any one of which solves the original 
problem. For instance, suppose it is necessary 
to search a table for the occurrence of an item 
known to be in the table somewhere. One possible 
algorithm is to search the positions of the table 
in random order. A search by n such processes, 
operating independently and in parallel, can 
be conducted, and the first process that finds 
the item is the one that determines the total 
running time, 


Such an algorithm could, in theory at least, 
exhibit the following strange behavior. With one 
processor, the average solution time is Ty 3 with 


nm processors and n independent processes, the 
average solution time fs T,; and T < T,/n. In 


other words, n processors can exhibit a speed-up 
of the average time that is more than a factor 
of n. 


A plausible explanation of such an 
apparently unlikely phenomenon is that the al- 
gorithm's running time is a random variable, 
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having a distribution F(x), for which the 
expected value of the minimum of n observa- 
tions is less than 1/n times the expected value 
of a single observation. For instance, 


F(x) = ne 0<x<1,0< 6 < 1/2, has the 
required property for all n> 2. This phenom- 
enon is described in more detail in [8]. It 
should be noted that while such behavior is 
theoretically possible, we know of no practical 
algorithms for which it has been observed. 


We consider here a different problem, where 
the decomposition is into a number of subproblems, 
the successful completion of all of which is 
required to solve the original problem. An 
example (in fact, the one that motivated develop- 
ment of the model proposed here) is a discrete 
optimization problem, such as integer programming, 
in which the space of possible solutions is 
partitioned into n disjoint subsets that are 
searched in parallel for the optimum feasible 
solution. The curious behavior here seems more 
believable than that described above, but still 
not entirely intuitive. It has been observed 
that average running times can sometimes be 
reduced by partitioning into more subproblems 
than there are processors, and by sharing the 
processors among the active subproblem-solving 
processes [8]. 


In late 1975, when the C.mmp multiprocessor 
at Carnegie-Mellon University [10] was configured 
with 5 PDP-ll's sharing access to a single large 
memory, experiments with a parallel implementa- 
tion of an implicit enumeration algorithm for 
0/1 integer programming were conducted. In this 
problem, the goal is to i 
Co + 521 ie 

n 


j2a 945%5 > Pa 


minimize: 


subject to: ,>l<i<no 


x, {0,1}, 1<j<n, 


A complete enumeration of the 2° possible 
solution vectors can be avoided by making use 
of "branch~and-bound” techniques, but the 
general approach still looks much like a tree 
search: "branch" on x,, say, and solve the two 
subproblems (each with n-l variables) in which, 
respectively, x is replaced by 0 and by 1 in the 


original problem. The subproblems are smaller 
instances of the same type problem, and can be 
solved by further division. If r variables 

are chosen initially for branching, there are 2 
subproblems, each with n-r variables, and they 
can be solved independently and in parallel. 
practice, it improves the solution time if 


In 


certain global information (a bound on the 
objective function value) is shared, but there 
is no requirement for any interaction except for 
one final comparison of the subproblem's optimum 
solution value with the best solution found so 
far for another subproblem. 


The experiments on C.mmp were not intended 
to determine how much speed-up could be obtained 
by parallel decomposition, but rather to exer- 
cise the C.mmp hardware and software in the 
system's early days of operation. Consequently, 
few precise timing runs were made. The limited 
experience offered by the parallel integer 
programming algorithm and the timings observed 
for it indicated, however, that even with only 
5 processors, the average solution times tended 
to fall as the number of subproblems solved in 
parallel increased well beyond 5. 


Figure 1 shows the behavior observed for a 
typical 20 variable, 20 constraint problem. 
Several runs were made with the r initial 
branching variables chosen at random, for r 
from 0 to 4, and the solution times averaged. 
The fact that average solution times tended to 
fall, even with 8, 16, and 32 subproblems being 
solved in parallel on only 5 processors, led to 
speculation regarding the underlying reasons for 
this phenomenon and to development of the analyt- 
ical model described below. 


Explanation of this behavior in a fairly 
realistic model, accounting even for overhead 
associated with processor sharing, is the purpose 
of this paper. Section 2 describes the hardware, 
scheduling, and problem models. Section 3 
presents the analysis of the ideal (no over- 
head) case, and Section 4 extends it to include 
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FIGURE 1 - Average execution times for a 
typical 0/1 integer programming 
problem on C.mmp system 
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_scheduling overhead. 


We conclude in Section 5 
with a brief discussion of possible applications 
and future directions. | 
2. The Model 

A simple multiprocessor model is that shown 
in Figure 2. The k identical processors operate 
asynchronously and in parallel, and communicate 
via a shared memory. For simplicity we assume 
there is no contention for this memory (an 
assumption that is entirely reasonable for cer- 
tain hardware configurations and reference pro- 
perties), and that no overhead is involved in 
locking shared data for exclusive access. In 
short, each processor operates as fast as if it 
alone were executing without the other k-1 pro- 
cessors. This assumption is necessary to make 
the model at all tractable; analysis of conten- 
tion effects is a difficult problem in its own 
right. In addition, we assume that the problem 


we are solving accounts for the entire computing 


load on the system; more about this later. 


A schedulable entity is called a process. 
In our model, each subproblem solution is 
computed by a separate process, and each process 
is either active (i.e., not completed) or 
inactive. Whenever there are at most as many 
active processes as processors, each process is 
bound to one processor, and no scheduling is 
necessary. If there are more active processes 
than processors, scheduling is by processor 
sharing, which means that each process ef fec- 
tively has only a fraction of a processor's 
computing power. For instance, with k pro- 
cessors running n > k active processes, the 
computation of each process progresses at k/n 
times the rate it would progress if it had its 
own dedicated processor. 


In Section 3, the analysis assumes that 
there is no overhead associated with processor 
sharing; in Section 4, we relax this restric-— 
tion. It should be noted, however, that if the 
system is shared with other tasks, if our job 
is allocated a fixed percentage of the system 
resources, and if the system-wide scheduling 
policy is processor sharing, then the results of 


Section 3 hold even though they do not account 


for overhead. This is true because all running 
times are multiplied by the same constant, 


k-port Shared Memory 


(no contention) 


FIGURE 2 — k-processor model with shared memory 


namely, the reciprocal of the fraction allo- 
cated to our job, since processor sharing takes 
place even when our job has fewer active pro- 
cesses than processors. Section 4, then, is 
necessary only because a dedicated multi- 
processor system could refrain from processor 
sharing at that point, so the overhead would be 
paid only during a part of the computation. 


The most controversial (i.e., unrealistic) 
aspect of our model is the problem model. We 
assume the problem to be solved can be decom- 
posed into a number of subproblems, for para- 
llel solution, with the following properties: 


(L) The time required by the algorithm to 
solve a random instance of the problem on a 
single processor is a random variable X having 
the distribution F(x). We assume that F(x) = 0 
for x < 0 since processing times are non-nega- 
tive, and that yp = E(X) is finite. 


(2) The problem can be solved by solving 
all of any finite number n of subproblems, each 
of which is of the same type as the original 
problem, but is probabilistically smaller (in 
solution time) by a factor of n. Therefore, 
solution time for each subproblem on a single 
processor is a random variable having the 
distribution F(nx). 


the 


(3) The subproblem solution times are 
independent of each other and independent of the 
solution time of the original problem. 


Property (2) seems questionable at first 
glance, but is in fact quite reasonable, espe- 
cially for many numerical linear algebra pro- 
blems, sparse matrix manipulations, discrete 
optimization problems, and queries in large 
data bases, for instance. Property (3) is 
unreasonable in most cases, but this is the 
price we must pay for the ability to get ana- 
lytical results. Actually, the subproblem 
solution times may be almost independent if n 
is very large, or may truly be independent if 
randomness is induced by the algorithm itself 


[8]. 


The problem to be considered here is how to 
determine n such that the expected solution time 
on k processors is minimized. Each subproblem 
is allocated one process, and the total solution 
time is the elapsed time to completion of the 
last process that finishes, since all sub- 
problems must be solved in order to solve the 
original problem. Although in the model n can 
be arbitrarily large, presumably any real 
problem can be subdivided only so far before 
the assumptions fail. A conclusion such as 
"make n as large as possible" means "make n 
as large as possible such that the assump- 
tions (1) through (3) above are satisfied". 


3. "No Overhead" Analysis 


Let T,, (a) be the total solution time on k 


processors when the problem is divided into n 
subproblems; let Kan be the jth smallest of n 


independent random variables from the distribu- 
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tion F(x); and let uy. = E(X,._). Define Y,. 


jim, jin 
to be the solution time of the j subproblem if 
each subproblem (process) had its own processor. 


Then EY = u,,,/n because of property (2) in 


J 
the problem model. Finally, let 25 be the 


elapsed time to completion of the 3° sub- 
problem using processor sharing on k processors. 
Throughout this analysis we assume n > k, since 
it is clear that it is always worthwhile to have 
at least k subproblems. 


In order to solve the problem, we must find 
an expression for T,(n). By definition, T, (n) = 


s. Furthermore, note that S5 = (n/k)Y, (since 
there are n active processes before time Sy and 


each has k/n effective processors) and that 


Bis. SO + ( 

j j-1 
for 2 < j < n-k. This follows from the fact that 
between the completion of the j-1S¢ and qth sub- 
problems, there are n-j+l active processes. 
Solving the recurrence we find that 

£ nok 

Sank “ Yacksn te) 421 Veen : 

After time S_, there are at most k pro- 
cesses still active, so each has its own pro- 
cessor and there is no sharing. The time 
remaining until all finish is simply 


¥ so 
nin 


We sig eee 


? te 


T,, (n) = so = Y 
Taking expectations of both sides gives 


B(T,(n)) = up. ,/m + (/k) SEr uw ./n 


jin 


which can be rewritten as 


E(T, (n)) =u/k + 


n 


tg - (Ak) janie is /n. 


jin 


Note that u/k is the best value of E(T, (n)) 


we could hope for, and that the expression above 
exceeds this by only 
n 


(u )/ (kn), 


jen2ktl nin /jin 


which for most distributions F tends rapidly to 
zero as n increases. For example, for the expo- 
nential distribution 


E(T,(n)) = 


k 
u(1l/k + (,2,(1/i))/n); 


and for the uniform distribution 
E(T, (n)) = u(1/k+(k-1)/(2n(mt1))). 


It is clear that for a fixed number of pro- 
cessors k > 2, each of these is a decreasing 
function of n. What we need to show is that the 
same is true regardless of the underlying distri- 
bution F. In order to do this, we form 
AE(T, (n)) = E(T, (n)) - E(T, (n-1)), and then inves- 


tigate its behavior for various values of n> k, 
This will allow us to determine for what values 
of n the expected total solution time is in- 
creasing and for which it is decreasing. If 


. AE(T, (n)) < 0 then it is better to have n sub- 


problems than n-1; otherwise, it is better to 
have only n-l. | 


' Forming the expression for AE(T, (n)), then 
applying an identity from order statistics [4] 


(n-k) u,, 


en yen = Maen 


to get all variables in terms of expected values 
of order statistics from a sample of size n, 

and finally simplifying the sums, gives the sur- 
prisingly simple expression 


AE(T, (n)) = (u )/ (n(n-1)) 


n-kin Ho-Lin 


for n > k. Since < for i a 
Wg =P or any distri 


n-l:n 
bution, AE(T, (n)) < 0 for all n > k. In fact, 


unless k = 1 or the distribution F is degenerate 
and all problems have identical solution times, 
AE(T, (n)) < 0, which means that we should make 


nas large as possible to minimize the expected 
total solution time. 


4. Accounting for Overhead 


As users of real multiprocessor systems such 
as C.mmp well know, processor sharing is not 
implemented without overhead. Nevertheless, 
experiments on that system with parallel solution 
of integer programming problems led naturally to 
the model used and to the conclusion reached in 
Section 3. In this section, we explore the 
effects of overhead on tractability of the model 
and see why it is not a serious problem. 


Fortunately, it is possible for our model 
to account for the overhead in a simple manner. 
An approximation to processor sharing is achieved 
by allowing each process to run for a small 
length of time (a "quantum") using round- 
robin scheduling of the active processes. The 
overhead is associated with "context swapping" 
from one process to the next. If we define c 
to be the ratio of the context swapping time to 
the quantum size, the total overhead incurred 
is cS ik? and | 


Ty; (n) = s + cS ik 
for n > k. This follows from the fact that 
sharing is necessary only so long as there are 
more than k active processes (see Section 2). 


Proceeding in a fashion similar to that 
used above we find 


A(T, (n)) = (cet) Cu ) + 


n-k:n "n-k-1:n 
He chee. © Wl: / (a(n-1)) 
where Hoenn = 0 by definition. 
It is clear that we cannot make the same 
strong statement that we should always create 


as many subproblems as possible, regardless of 
c and F(x). Obviously, for large values of c 
we should avoid creating more processes than 
we have processors in order to avoid processor 
sharing. What is not so obvious is how to 
compute the optimum number of processes we 
should create for given c and F. 


It turns out that it is helpful to define 


a new quantity ere = ae eee gee 
which is the expected value of the difference 


between the ria and: 441°° largest (not smallest, 
as before) of n random variables from the 
distribution F. Writing AE(T, (n)) in terms of 


6 ». we have 
nin 
A(T, (n)) = (oGetl) 6) 55. = 
: | 
522 fae? patent)? 


where OG = Wy wn? This means that 


AE(T. (n)) <0 iff 
k i 


ce < (U/(et1)) 2, (55, 0/6 


jin een 


For certain forms of the distribution F 


there exist simple expressions for ne) 
allowing us to compute AE(T, (n)) explicitly for 


certain cases. For example, for the exponential 
distribution Dass = u/j and for the uniform 


distribution ee = 2u/(ntl). Therefore, 


AE(T, (n)) < O iff 


k 
e < ,£, (1/i) (exponential) 
1=2 
AE(T, (n)) < 0 iff 
c < (k-1)/ (k+1) (uniform), 


Since these conditions are independent of n, 

we may still reach the (rather strong) con- 
clusion that if c satisfies the appropriate 
condition above then we should create as many 
subproblems as possible. If ¢c is too large, 
then we should create exactly k subproblems. 
These conditions on c are extremely weak, since 
the value of c for a real system might be on the 
order of a few percent, while c < 1/3 suffices 
here even for only two processors. 


For most other distributions, no such 
closed-form expressions for ae are available. 


However, we can divide the possible values of n 
into three mutually exclusive and exhaustive 
ranges: 


Cl). 1.40 sks AE(T, (n)) < 0 always 


(2) a etl AE(T, (n)) < 0 4LfE 


Q 
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(3) n> kt: AE(T, (n)) < 0 iff 


in! iepiea? 

It is noteworthy that for n > ktl, the 
critical value of c (call it C) depends only on 
ratios of differences between expected values of 
order statistics. These ratios are, of course, 
distribution dependent, but are independent of 
location and scale parameters. Thus, if the 
distribution F is a gamma distribution, for 
instance, we can calculate the values of C for 
n > k+l without knowledge of the actual mean 
and variance for F. 


k 
ec < (1/kt1)) 5, (6 


For the sake of argument, let us assume that 
the distribution of problem solution times is 
adequately represented by a gamma distribution. 
Tabulating the critical values for this distri- 
bution, we find that c < 1/4 is a sufficient 
condition for AE(T, (n)) < 0 for all k > 2 and 


-n > k+l. Since c should be much smaller’ than 
that for a real system, we will assume that this 
condition is satisfied. 


Now the only question is whether 
AE(T, (k+1)) <0. If it is, then we should 


create as many subproblems as possible. If it 

is not, then we need to know whether the increase 
in solution time at n = k+l can be offset by the 
known decreases thereafter. The second problem 
is easy, since E(T, (n)) > u(1i+e)/k as n > ©, | 


Therefore, creating as many subproblems as 
possible is better than creating just k sub- 
problems iff c < Ve ey/U - 1. This apparently 


makes the answer to the former problem irrele- 
vant, for if AR(T, (k+1)) < 0 we would certainly 


have c < Wey! H ~ 1... Hence, we can conclude that 


if the ratio c of overhead to quantum size 

is at most 1/4, then the expected total solu- 
tion time is minimized by letting n = k when- 
ever c > Wey /t - 1, and by making n as large 


as possible if c < Hy /H - 1. 


The value of u really only determines the 
time unit and may therefore be set to 1 without 
loss of generality. In this case, MeeR = V Zeke? 


where Zz is the expected value of the largest 


k:k 
of k random variables from a gamma distribution 


with parameter L/v*. The coefficient of varia- 
tion of this distribution is V. Im order to 
determine in practice how many subproblems to 
create we could estimate c and V from experi- 
mental data, look up Zak in a table of expected 


values of order statistics [7], and decide on the 
basis of the criterion above. 
5. Conclusions 
Order statistics are a natural for analysis 
of many parallel algorithms, since total running 
times depend on those of the minimum or maximum 


running times of subproblem solutions. It is 
only reasonable that computer science should make 
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good use of the large body of knowledge statis- 
ticians have already compiled regarding their 
behavior. 


Other scheduling models lead to interesting 
problems. Coffman and Denning [3] note that, 
in general, processor sharing may be better or 
worse than.a simple list schedule (in which the 
n processes queue up to the k processors, and 
when one finishes another begins). While it is 
possible to construct examples where each is 
superior, which is better on the average? 
Analysis of list schedules seems to require 
use of renewal theory rather than classical 
queueing theory. In any event, we feel we 
have made a good case for further exploration 
of the application of statistical methods to 
performance models of parallel algorithms. 
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PARALLEL ALGORITHMS FOR THE MINIMUM 
~ SPANNING TREE PROBLEM 


Narsingh Deo and Year Back Yoo 
Computer Science Department 
Washington State University 

Pullman, Washington 99164 


Summary 


Sequential algorithms for minimum spanning 
tree (MST) fall into three categories--all using 
greedy strategies. They are: (1) Prim-Dijkstra 
nearest-neighbor method, (2) Kruskal's lightest- 
edge-first method, and (3) Sollin's lightest-edge- 
from-each-vertex method. In this paper we study 
the parallelizability of these algorithms. 


Recently some research effort has been 
reported on parallel algorithms for solving the 
MST problem [1], [4], [5]. Savage [4] proposed a 
parallel MST algorithm based on Sollin's, which 


runs in 0(109° n) time on a parallel machine with 


O(n"/log n) processors (where n is the number of 

vertices in the graph). Bentley [1] has given a 

parallel version of Prim-Dijkstra algorithm which 
runs in O(n log n) on a special-purpose parallel 

computer, called the tree machine. 


In this paper, we design three parallel MST 
algorithms under the assumption that (i) the 
number of available processors is no more than n 
and that (ii) available machine is an MIMD-type 
general-purpose parallel computer. 


PRIM-DIJKSTRA ALGORITHM: With weight matrix as 
the data structure used, sequential Bos Dijkstra 


algorithm has a time complexity of o(né ). We 
parallelize this algorithm with p processes as 
follows: P processes are created. Each process 
takes n/p vertices and finds its nearest vertex 
in parallel. The processes are synchronized, and 
the nearest vertex is found. Then another set of 
p processes are created for updating. The pro- 
cesses are synchronized again and one iteration is 
complete. This parallel algorithm requires the 
Same number of total iterations as the sequential 
one, but the work is divided among p processes. 
If “i snotee p = vn , the time complexity becomes 


O(n *), This performance compares well with 
Bentley's O(n log n) time on n/log n - processor 
tree machine [1]. Both have processor-time pro- 


duct of 0(n°). This also compares well with 


Savage's processor-time product of O(n? log n) 
[4]. The parallel Prim-Dijkstra algorithm may be 
described in an Algol-like language as follows: 


Process MAIN 


for 1:= 1 to n do 

begin NEARTT: = 1; DIST[i]:= W[1,i] end; 
NEAR[1]:= 03; (* Start with vertex 1 * 
while |T| < n-1 do (* edges in MST is n-1 *) 
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begin | 
Vmin:= be syn:= 0; (* syn : semaphore *) 
for i:= 1 to p do create TASK1(i); 
while syn < p do wait; | 
T:= Tu (ji; NEAR[JJj])3; NEAR[Jj]:=0 03 syn:=05 
for i:= 1 to p do create TASK2(i); 
while syn < p do 1 wait; 

end; 


Process TASK1(j) 


Tit] 1: | 
for i:= j to n by p do 
if NEAR[i] > 0 and DIST[i] < DIST [ii] 
then ii:= 15 
lock Vmin; 


At: DIST[i7] < Vmin then 

begin jj:= ii; Vmin:= DIST[ii] end; 
unlock Vmin; 
lock syn; syn:= syn + 15 unlock syn; 


Process TASK2(j) 
for T= J ton by p do 
begin k:= NEAR[i]; 
if k > 0 and W{i, k] > Wi,JJ] then 
begin NEAR[i]:= jj; DIST[il:= WLi.jj] end 
end; 
Tock Syn; syn:= syn + 13 unlock syn; 


KRUSKAL'S ALGORITHM: With a heap as the data 


‘Structure used, the time complexity of Kruskal 's 


algorithm is O(m log m), where m is the number of 
edges. One way to parallelize Kruskal's algorithm 
is to use two processes, Producer and Consumer, 
which run asynchronously. A circular queue, Q, 

is used as a message buffer. The two processes 
operate as follows: 


Producer maintains a min-heap and sends the 
top item, which is the next lightest edge to be 
considered, to Q. If Q is full, Producer waits 
until Consumer takes out an item from Q. As long 
as Q is not full, Producer continuously produces 
the next lightest edges and sends them one by one 
to the rear of Q. 


Consumer takes out items continuously one by 
one from the front of Q as long as Q is not empty. 
Then it examines whether or not the current edge 
creates a cycle. If not, Consumer adds the edge 
to MST, combining the two subtrees. 


Heap adjusting step cannot be done in paral- 
lel because of the inherent precedence constraint. 
Therefore, the complexity still remains 
O(m log m) . Furthermore, the degree of parallel- 
ism is at most two. These limitations make 
parallel version of Kruskal's algorithm less 
attractive. This version is given below: 


Process MAIN 
T:= 0; syn:= 03; num:= 03 
make initial heap; 
create PRODUCER; 
call CONSUMER; 


Process PRODUCER 
last:= m; rear:= 0; 
while CONSUMER is live do 
egi 
if Q is full then wait; 
rear:= (reart+t1) mod b; (* b = |buffer| *) 
send top item of the heap to the rear of Q; 
lock num; num:= num + 1; unlock nun; 
move last item of the heap to the top; 
last:= last - 1; 
call HEAP(1,last) 
end; 


as 


(* Adjust heap *) 


Process CONSUMER 


while |T] < n-1 do 


begin 
if Q is empty then wait; 


front:= (frontt+1) mod b; 

u,V.W := Q[front]; 

lock num; num:= num - 1; unlock num; 

rl:= FIND(u); rl:= FIND(v) 

if rl <> r2 then (u,v) is in MST *) 

begin T:= T u (u,v); call UNION(ri1,r2); end 
end; 


SOLLIN'S ALGORITHM: Sollin's algorithm [5]can be 
parallelized as follows: The lightest edges inci- 
dent on each vertex are selected simultaneously. 

If the resulting forest does not form a spanning 

tree, the same procedure is applied to the forests 
until only one tree is formed. In the worst case 
this algorithm will require log n pees with 


n processes. Each iteration requires O(n 210) 
units of time if p processors are ir degiag 


Therefore time complexity becomes a(n® /p log n). 


Processor-time product is O(n” log n) which is the 
same as that of Savage's. A large transportation 
network is often in the form of a grid in which 
the degree of each vertex is four or less.: Such 

a graph if stored in the forward star form would 
require at most three comparisons per vertex to 
find the lightest edge incident on it (rather than 
n-1). In that case the time complexity of this 
algorithm will be 0(3n/p log n). A detailed 
description of the parallel Sollin's algorithm is 
given below: 


Process MAIN 


T:= @5 
while |T| < n-1 do 
begin 
syn:= 05 Vmin:= «3 
for i:= 1 to p do create TASK(i); 
while syn < p do wait; | 
for i:= 1 to n do 
begin 
if Vmin[i] <> » do 
begin 
rl:= FIND(Vi[i]); r2:= 
if rl <> r2 then 


begin 


FIND(VJ[i])s 
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T:=T u (Vifil,Vjli]); call UNION(r1,r2) 
end 
end 


Process TASK (ii) 
for 1:= ii to n by p do 


for j:= 1 to | n do 
= FIND(j); 


Vi[rl]:=i13; Vj[rl]:=j; Vmin[rl]:= W[i.j] 
d 


Tock syn; syn:= syn + 13; unlock syn; 

REMARKS and CONCLUSION: A parallel version of 
Cheriton and Tarjan's O(m log log n) algorithm 
turns out to be Sollin's. The reason is that 
Cheriton and Tarjan's algorithm was derived from 
Sollin's [2], [5]. 


Sollin's algorithm is easily parallelized; 
Kruskal's is not. Prim-Dijkstra algorithm falls 
in between the two. 


These parallel algorithms have been coded in 
HEP Fortran and an empirical study is underway 
for comparing their average case performances. 
Details are given in [3]. 
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Abstract ~~ Image correlation is representative 
of a wide variety of window-based image processing 
tasks. 
tems (e.g., PASM) can use SIMD parallelism to per- 
form image correlation is examined. Two fundamen- 
tal algorithm strategies are explored. The "time 
/ space / interprocessor-transfer" complexities of 
the two algorithm approaches are analyzed in order 
to quantify the differences resulting from the two 
strategies. For both approaches, the asymptotic 
time complexity of the N-processor SIMD algorithms 
is (1/N)-th that of the corresponding serial algo- 
rithms. 

1. INTRODUCTION 


Image correlation is a widely used procedure in 
many areas of image and picture processing. This 
process, also known as template matching, is used 
in some forms of edge detection (13], or in image 
registration, to match pieces of two pictures to 
one another (12]. in digital photogrammetry, im- 
age correlation is used to find the corresponding 
points of two images of a stereomodel. In this 
application, image sizes are typically at least 
4096 by 4096 with match areas on the order of 64 
by 64. 

Because image correlation requires comparing 
portions of two images in a Large number of rela- 
tive positions, it is an extremely time consuming 
process. The time required to complete these cal- 
culations can be reduced by exploiting the paral- 
Lelism inherent in tne task.. The way in which 
multimicroprocessor systems (e.g., PASM (16]) can 
use "SIMD' parallelism to perform this task is ex- 
amined here. 

The SIMD (single instruction stream - multiple 
data stream) [8,20] machine model used here con- 
sists of a control unit, interconnection network, 
and N PEs (processing elements), where each PE is 
a processor-~memory pair C153. In an SIMD machine 


of size N= 2", the PEs are addressed (numbered) 
from 0 to N-1. In proposed systems, N is as Large 
as 1024 [16] to 16,384 [11]. The control unit 
broadcasts an instruction to all PEs, and all ac- 
tive (Cenaodled) processors simultaneously execute 
the instruction, each on data in its own memory. 
The interconnection network provides inter-PE com- 
munication. SIMD parallelism has been shown to 


This work was supported by the Air Force Office of 
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The way in which multimicroprocessor sys-~ 
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reductions in computation time 

for image and speech processing tasks 

Ce.g., 9,14,18]. Here, window~based ~ image pro- 

cessing tasks are considered. 

In the complexity analyses that follow, it is 
assumed that each required parallel inter-PE data 
move can be done in one transfer step. This will 
be true if the interconnection network used is a 
multistage network such as: (a) one employing the 
generalized cube topology with individual box con- 
trol C17] Ce.g., omega ([C10J, n-cube (£113); (Cb) 
the data manipulator network [6]; or (c) the 
augmented data manipulator C17]. This is because 
each required transfer is either a type of ex- 
change (cube connection [15]) or a "uniform shift" 
Cis@e> from PE ji to PE itk mod _ =~wWN, 
O<i<N, k fixed). 

Only those SIMD machine features needed for the 
algorithms that follow nave been described. The 
model is intended to provide a general framework 
in which SIMD algorithms can be developed. In 
section 6, the performance of the algorithms using 
an alternative model will be discussed. 

The opjectives of this study are as follows: 

1. To demonstrate the applicability of the SIMD 
mode of parallelism to a class of image pro- 
cessing tasks. The operations performed in 
image correlation are representative of the 
types of data manipulations needed for a wide 
variety of window-based image processing 
tasks. 

2. To explore two fundamental parallel algorithm 
strategies. In one approach, all of the data 
that will be needed by a PE is transferred to 
the PE and processed there. In the other, 
each PE performs all possible operations on 
its local data, generating partial results 
which are then transferred to the PE in which 
they are needed. | 

3. To analyze and compare the computational re- 
quirements of the alternative algorithms. in 

serial algorithms, there is often a_ tradeoff 

between computation time and space. In paral- 
lel. algorithms, the tradeoff may be a function 
of three parameters: computation time, space, 
and inter-PE communications. 

In the next section, image correlation is de- 


yield significant 


fined. In the subsequent sections, parallel algo-. 
rithms for image correlation are presented and 
analyzed. 


2. IMAGE CORRELATION 


A. Definition and Serial Algorithms 

An image is. represented by a two-dimensional 
array where each element ("pixel") has an unsigned 
integer value representing the "gray Level” of the 
pixel. Image correlation involves determining the 


position at which a relatively small match area 
best matches a portion of an input image. Corre- 
Lation measures are used to measure the degree of 
similarity or disagreement between the match area 
and an equivalent size area on the input image. 
Let the symbols x and y denote single elements of 
arrays X and Y, where X is the match image and Y 
is an area of the input image which has the same 
dimensions as X. Let M be the number of elements 
in the match area X. Two representative correla- 
tion measures are: 


SXY = Exy - =x Ly/M 


RXY = sxv/(sxxasyy) 61/2 


Correlation measure SXY is the covariance of 
the match area with a portion of the input area. 
Large positive values indicate similarity, while 
Large negative values indicate similarity between 
a positive and a negative image. Values near zero 
indicate Little or no similarity. Correlation 
measure RXY is the Linear correlation coefficient 
of statistics. This measure is a normalized ver- 
sion of SXY, with values ranging between +1 = and 
-1. A value of +1 indicates exact similarity 
while values near zero indicate Little similarity. 
In general, a correlation value will be computed 
for every possible position where the match area 
will fit on the input image. The match position 
where the correlation measure is maximized 
corresponds to the best placement of the match 
area on the image. 

The computation time for image correlation is 
dominated by the time to compute the rxy, Ly, and 


(for measure RXY) the zy* values for all possible 


match positions. The <x and Ex? values involve 
only the match area elements, and need to be com- 
puted (or precomputed) only once. 

The way in which data elements are combined to 
obtain the srxy values is similar to operations 
performed in a variety of important image process~ 
ing tasks, including convolution and filtering. 
For an input image having R rows and C columns and 
a match area having r rows and c columns, there 
are (R-r+1)(C-ct1) match positions. Serial compu- 
tation of the i<xy terms over the entire image, 
performed by simply sliding the match area _ over 
the image and calculating the value of ixy for 
each overlap position, requires (R-rt+1)(C-ct+1)re 
multiplications and (R-r+1)(C-ct1)(re-1) addi- 
tions. 

In computing the sxy values, each match posi- 
tion generates a new set of terms to be summed. 
No terms from one match position can be reused in 
a different match position. In computing the ry 


and sy? values, two (or more) input image elements 
summed for one match position may also be summed 
for another match position. The algorithms con- 


sidered for calculating the yy = and ny* values 
therefore attempt to avoid "redundant" operations, 
e.g., performing a sum for one match position 
which has already been performed for another. The 


operations performed in computing the sy and zy* 
values, i.e., the summing of elements under a win- 
dow where the window moves over an image, are typ~ 
ical of operations required for a variety of image 


processing tasks. These include image smoothing, 
edge enhancement, and convolution using aé=erec- 
tangular window. 

Consider the following serial (uniprocessor) 
algorithm for computing the ty's, i.e., summing 
the pixel values in each match area. This algo- 
rithm will be used as a basis for parallel algo- 
rithms. 

Assume that for input image I, the position of 
the match area is defined by the coordinates of 
the input image pixel covered by the upper left 
corner of the match area. Let "colsum'" be a vec- 
tor of length C, where 


kt+r-1 
colsum(j) = >> 


i= 


I(i,j) 


where k is the row coordinate of the current posi- 
tion of the match area, and 0 < j < C. Let SUM be 
an R-rt+1 by C-ct+1 array, where SUM(Ci,j) is the sum 
of pixels of I for the match area position (i,j), 
QO <i < R-rt1, U <j < Coctt. 

The algorithm is shown in Fig. 1. First, col- 
sum is initialized for row 0 of the image. The 
colsum values for columns 0 to c-1 are summed to 
compute SUM(U0,0). SUM(O,j}) for 1 < j < C-ct1 is 
computed from SUM(0,j-1) by subtracting 
colsum(€j-1) and adding colsum(jt+c-1). <A similar 
strategy is used to compute SUM(1,j) for 
1 < i < C-cti and 1 < j < R-r+1. To do this, each 
colsum(j) is first updated by subtracting I(¢i-1,j) 
and adding ICitr-1,j), 0 < j < R-rt1. 

The complexity of this serial algorithm, in 
terms of additions, is 


4RC - Reo -— 3Cr + reo + 5C + BR - 2c - Sr t 4. 


(For simplicity, the additions required for Loop 
counting and indexing have not been included. In 
the SIMD algorithms, these would be performed in 


/* initialize values of colsum */ 
for j = 0 to C-1 do 
colsum(j) = 1(0,}) 
for i = 1 to r-1 do 
colsum(j) = colsum(j) + 1(i,j) 
/*x compute SUM(9,j) for 0 < j < C-(c-1) */ 
SUM(0,0) = colsum(0) _ 
for j = 1 to c-1 do 
SUM(0,0) = SUM(0,0) + colsum(j) 
fOC j= t toC=€ce,) edo. 
SUM(O,j) = SUM(O,j-1) - colsum(j-1) + colsum(jtc-1) 
/* compute SUMCi,j) for 1 < i < R=-(r-1) 
and 0 < j < C-(c-1)) * 7 
for i = 1 to R-(r-1) do 
/* compute SUM(i,0) and update associated 
colsum values */ 
for j = U to c-1 do 
colsum(j) = colsum(j) - 1¢€i-1,j) + IGi+r-1,j) 
SUM(i,0) = colsum(0) 
for j = 1 to c~1 do 
SUM(7,0) = SUM(i,U) + colsum(j) 
/x compute SuM(7,}) and update associated colsum 
values for 1 < j < C=(c-1)) */ 
for j = 1 to C-(c-1) do 
colsum(jtc-1) = colsum(jtc-1) - 1(i-1,j+c-1) 
+1044 7r=1,j) +671) 
SUM(Ci,j) = SUM(1,j3-1) - colsum(j-1) 
+ colsum(jt+c-1) 


Fig. 1: Serial algorithm to compute ry terms. 


the control unit, and could be overlapped with the 
PE operations.) This algorithm moves the match 
area along the rows of the input image. Depending 
on C, R, c, and r the algorithm complexity may be 
less by moving along columns. 


Computation of the sy°'g is similar. In this 


case, the y° values subtracted from the colsum's 
in the update process (see Fig. 1) must be saved 
when they are first calculated. This increases 
the space required for the algorithm by rC. The 
arithmetic complexity is increased by RC multipli- 
cations. 


If the xrxy, ty and sy° values for a given match 
position are computed together, the correlation 
measure for that match position can be calculated, 
and is saved only if it is the current maximum 
over the correlation measure values computed so 


far. Thus, the ixy, Ly, and ty* values for each 
position do not have to be saved. 


B. Parallel Image Correlation 


In section 3, a parallel algorithm for comput- 
In sections 4 


and 5, parallel algorithms for the ixy, fy and ry? 
computations are presented. For the ixy, fy and 


ing the rx and ax? values is given. 


xy° operations, two algorithm strategies are ex- 
plored. For both, the input image data will be 
divided among the PEs, and each PE will compute 
the vatues of the correlation measure for a por- 


tion of the input image. In the first, "complete 
sums" approach, all of the data which will be 


needed for the computations performed in a given 


PE is transferred into that PE. ALL subsequent 
operations can then be performed Locally, so that 
each PE computes the “complete sums" for a set of 
match positions. In the second, "partial sums" 
approach, each PE performs as much of the computa- 
tions as possible using its own data, then 
transfers partial results to the PE in which they 
are needed. 

In order to distribute the input image, the N 
PEs of the system are Logically configured as an 
NR by NC rectangular grid, on which the R by C im- 
age is superimposed. Thus, with the possible ex- 
ception of the rightmost column and bottommost row 
of PEs, each PE holds an R' by C' subimage, where 
R*' = TR/NR1T and C' = FC/NC1. This is shown in 
Fig. 2. The values for NR and NC will be chosen 
to minimize execution time of the algorithms, and 
will be discussed in section 4.A. 

(An alternative to these approaches is to as- 
sume that the SIMD machine has the capability to 
Load the image data into several PEs simultaneous- 
ly. With this capacity, an element of the input 
array which is needed in several PEs could be 
Loaded into the appropriate PEs (with Little or no 
cost) simultaneously. This would eliminate the 
need for inter-PE transfers. However, the memory 
management necessary to place each image point in 
the appropriate Location in each PE may be signi- 
ficantly more complex than the memory management 
needed to load the PE memories with disjoint sub- 
images. This approach will require an “intelli- 
gent'' memory management system and more storage in 
each PE, and will not be considered here.) 
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PE 1 


PE NC +1 


OR 
pixels 
rR! 
pixels 
C pixels 
Fig. 2. Data assignment of R by C image to N PEs. 


In the algorithms to compute the ixy, Zy and 
sy° values, it will initially be assumed that the 


results calculated (i.e., the <=xy, fy and ny’ 
values) are saved. For the calculation of RXY and 
SXY, this will not be necessary, as will be 
described in subsections 4.C and 5.C. However, so 


that each of the rxy, ty, and ny? algorithms can 
be applied to other related computations, in the 
presentations it will be assumed that the results 
for the whole image are to be stored. 
3. <x AND 5x’ COMPUTATION 

The xx and ax” values may be precomputed and 
stored with the match area, or computed in a 
straightforward manner in parallel before calcu- 


Simply assign 
Each PE 


first computes x? for all the elements it holds. 


Lating the xixy, Ly, and ny? values. 
to each PE M/N of the match area pixels. 


It then sums its x values and sums its xe values. 


ALL of these local 2x and Local Ex® sums are then 
combined using a recursive doubling aporoach (19]. 
Each even numbered PE J sends its Local =x result 
to PE J+1. Simultaneously, each odd numbered PE 


J+1 sends its local rx? to PE J. The odd numbered 
PEs add the received data to their local x and 
then compute the whole <x using recursive dou- 
bling, with the result saved in each odd numbered 


PE. Similarly, the even numbered PEs compute mx’, 
These two recursive doublings can occur simultane- 
ously. The odd and even PEs then exchange 


results, so that each PE contains both rx and sx*, 
This requires M/N multiplications, nt+(2M/N)-2 ad- 
ditions, and n+1 inter-PE data transfers. (A 
serial algorithm will require M multiplications 


and 2-2 additions.) Each Pe will store: sx and :x° 


for later use. 
4. COMPLETE SUMS APPROACH 


A. zxy Computation — 


In the complete sums approach, each PE will 
compute the correlation measure for overlap posi- 


C! 

eS aE 
PE J 

PE J+] 
Cc 
R! 
he 
‘ 
PE J+NC PE J+1+NC 
Fig. 3. Example of overlap position requiring 


data transfers. The shaded pixel represents the 
“beginning” of the overlap position. The arrows 
indicate the directions of the data transfers. 
(Proportions of match area to a PE's subimage are 
not necessarily to scale.) 


tions which "begin" in the PE (i.e., tor which the 
upper left corner of the match area overlaps a 
point of the PE's subimage). Each PE will there- 
fore compute the correlation measure’ for R'C' 
match positions. The computations will be per- 
formed simultaneously in all PEs. For match posi- 
tions where the portion of the input image is not 
fully contained in a single PE (Fig. 3), the need- 
ed points will be transferred before the computa- 
tions are performed. Such transfers will occur 
simultaneously for all PEs, so that at the same 
time that a pixel is being transferred, for exam- 
ple, from PE J+1 to PE J, the corresponding pixel 
is being transferred from PE J+2 to PE J+1, from 
PE J+3 to PE J+2, and so on. 

Depending on the size relationships between r 
and R‘' orc and C', the transferred elements may 
come from PEs adjacent to PE J, or from. several 
levels of adjacent PEs. If, for example, the 
match area dimension in one direction is Large in 
comparison to the dimension of the portion of the 
input area stored in each PE, the matches will ex- 
tend over several PE areas in that direction. PE 
J will transfer some y values a distance greater 
than one, and will receive some y values from a 
distance greater than one. Without loss of gen- 
erality, in the subsequent discussions, it will be 
assumed that elements are needed only from adja- 
cent PEs. 

When computing the rxy values, all PEs will use 
the same match area element simultaneously, so 
that element can be broadcast to all PEs from the 
control unit. Alternatively, if PE memory space 
is available, the match area, which is typically 
small, can be held in each PE's memory. The time 
to perform the broadcast from the control unit 
versus the memory fetch from the PE memory will be 
implementation dependent. In the space analyses 
that follow, it will be assumed that the match 
area values are broadcast from the control unit. 


Complete Sums Partial Sums 


# mult steps R'C're . R'C're 

# add steps R'C'(re-1) R'C'(rce-1) 

# transfer R'Com1 40" (r=1) R*Co=1)4C"(r=1) 
steps +(r-1) Cc-1) +(r-1) (c-1) 

space 2R °C '+(2c-2) (r-1) +1 2R'c! 


Table 1: Complexity of complete sums and partial sums 
algorithms for computing sxy terms. 


Computation of all of the xsxy terms will be ac- 
complished in the time required to compute the rxy 


terms for the R' by C' subimage held in a_ single 
PE. These times are Summarized in_ the first 
column of Table 1. Storage will be required for 


of the input image (R'C' ele- 
ments), for the computed xxy values (R'C' ele- 
ments), and for the input image elements 
transferred to the PE in order to provide all of 
the data needed for the PE's match positions. The 
number of transferred elements is 
Ccm-TI)R' + Cr-1)C" - Cr-1)€c-1)3; however, it is not 
necessary to store all of these values at the same 
time. Consider the extra storage needed for non- 
Local y values by a typical ("non-edge") PE J. 
The analysis is divided into two cases. It will 
show that at any point in time at most 
(2c-2) (r-1)+1 locations are required. 

First, consider when the match area (upper Left 
corner) is positioned in row i, 0 < i < R'-r (see 
Fig. 4). (c-1)r locations are required for non- 
Local y data, for the y data for columns 0 to c-2 
of rows i to rti-1 of PE J+1's subimage. The xy 
values can be calculated by moving the match area 
from position (0,0) to (0,1) to ... (0,C'-1), then 
from (1,0) to (1,1) to ... (1,€'-1), and finally 
from (R'-r,0) to (R'-r,1), to «2. (CR'or,C'-1). 

Next consider when the match area is po- 
sitioned in row i, R'-r < i< R' (see Fig. 4). In 
this case, at most (2c-2)(r-1)+1 locations are re- 
quired for non-local y data. For these match po- 


the PE's. portion 


C! 


Fig. 4. Indexing in a PE's subimage. 
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sitions the match area will move along columns in- 
stead of rows, from (R'-rt+1,C'-1) to (R'-rt2,C'-1) 
to eee (R'“1,C'-1), then from (CR'-rt1,C'-2) to 
(R'-rt2,C'-2) to «2. (R'=1,C'-2), ... finally 
from (R'-rt+1,0) to (R'-r+2,0) to ... (R'-1,0). 
For match positions (i,j), where R'-r <i<R', 
and j is fixed at a value in the range 
QO < j < C'=c, the non-local y data needed are rows 
O to r-(R'-i)-1 of columns j to jtc-1 of PE J+NC's 
subimage. For match positions (i,j), where 
R'=-r < i<R', and j is fixed at a value in the 
range C'-c < j < C', the non-local y data needed 


are rows i to R'=-1 of columns 0 to c-(C'=-j)-1 of 


PE J+1's subimage, rows 0 to r-(R'-i)-1 of columns 
j to C'-1 of PE J+NC's subimage, and rows QO to 


r-(R'-i)-1 of columns QO to c-(C'-j)-1 of PE 
J+NC+1's subimage. The maximum non-local — y 
storage needed for this range of i andj is 


(2c7-2) (r-1) +1. 

For given c, r, C, R, and N, the number of ar- 
ithmetic operations required for the algorithm is 
minimized by minimizing Ip =R'C'. By choosing 
Ip = RC/N, 1.e@., by dividing the input equally 
among the PEs, this minimum is_ attained. The 
number of transfer steps will be minimized by the 
values of C' and R' for which the expression 


Cr-1)C* + Cc-1)R' 


Minimizing with respect to R' gives 
(1/2) 


is minimized. 
R' = ((r-1)*Ip/(c-1)) 


subject to the constraints that R' and Ip/R' be 


integers. It will follow that 
C8 = (Con1)*Ip/(r-1)) 61/2), 


In the special case where c = r, the image should 


be distributed such that 


meagre’, 


that is, each PE should contain a square subimage. 


B. oxy and sy* Computation 
The complete sums algorithms to compute Ly and 


y* values will be based on the serial Zy and sy? 
algorithms, with each PE operating on-— an 
CR'tr-1)(C'+c-1) subimage. 


Consider computing sy and sy? in a_ typical 
("non-edge") PE bs A total of 
Cr-1)C'+Cc-1)R'trco-1 y values must be transferred 


into the PE from adjacent PEs, as discussed in the 
previous subsection. The transfers are as_ shown 
in Fig. 3. However, it is not necessary to store 
all of these if data is transferred only when it 
is first needed. This is explained below in two 
cases. It will be shown that at most (c~1)r Loca- 
tions will be required at any point in time. 

When the match area is positioned in row i of 
the PE's subimage, 0 < i1< R'-r, (c-1)r storage 
Locations are required for nonlocal y data, for 
the y data for columns 0 to cm2 of rows i to rti-' 
of PE J+1's subimage. 

When the match area 
of the PE's subimage, 


is positioned in row 1 
R'=r < 1 < R', at most 


194 


(c-1) (r-2) Locations are required for non-local y 
data. Most y data can be incorporated into the 
current ty being computed and the appropriate 


"“colsum" vector Location when it is transferred 
into a PE. The only y data that needs to be saved 
is that which will be needed for Later “colsum" 
updates. Specifically, this is rows R'=rt+1_ to 


R'-2 of columns 0 to c-e of PE J+1. 
Using these data storage strategies, the ry and 


y* values for each match position can be calcu- 
Lated as described in the serial algorithms (for a 
(C'+c-1)(R'+r-1) image). The complexities for the 


ry and sy* computations are given in column one of 
Tables 2 and 3 respectively. 


Complete Sums Partial Sums 


# add steps 4RC'+3R'ctC ir 4RIC'HR'ct#3C ir 
“R'+C '+rc ~SR'=5C tre 
-r -3r-ct3 
# transfer R'Ccr1)+C'Cr-1) R'Ccr1)4+C' Cr-1) 
steps +¢€r-1) (c-1) +(r-1) (c-1) 
space 2R'C'+(C'+ce-1) 


eR'c'+c! 
+(€c-1)r 


Table 2: Complexity of complete sums and partial sums 
algorithms for computing sy terms. 


Complete Sums 
CRP +r—-1)(C'+c-1) R'ct 


Partial Sums 
# mult steps 
# add steps 


4RIC'+3R'ctC'r QRIC'+HR'ct#3C'r 


mR S40 tre “3BR'-5C'tre 
-r -3r-ct3 
# transfer R*Ccr1)4C'Cr-1) | R'(cn-1)4+0' Cr-1) 
steps +(r-1) Cc-1) +(r-1) (cr) 
“space 2R'C'+(C*4+0-1) Cr+1) 2R'C*+C' (r+1) 


Table 3: Complexity of complete sums and partial sums 
algorithms for computing sy? terms. 


C. RXY and SXY Computation 


To compute RXY (Cor SXY) the previously 
described operations are interleaved so that the 


IXY, LY, sy’, and RXY (SXY) values for one match 
position are computed before the match area is 
moved to a new position. The maximum RXY (SXY) 
value and its match position coordinates are 
saved. The computation of RXY is described; the 
SXY computation is a subset of those operations. 
Consider the computation performed in a typical 
("non~edge") PE J. In order to combine the algo- 
rithms of subsections 4.A and 4.B, the Ixy algo-~ 
rithm must be slightly modified. The match area 
will move over the image in the way that was 


described in the xy algorithm, that is, from posi- 
(0,C'-1), then. from 
(1,1) to oe. (1,C'-1), «2. finally from 
(R'-1,C'-1). The 
is for 0 <i < R'-r, when 


tion (0,0) to (0,1) to ... 
(1,0) to 

(R'-1,0) to 
worst case for space 


CR'-1,1) to «ee 


r(C'+tc-1) space is needed for yo values and r(c-1) 


for y values (plus “coisums" and the original im- 
Less space is needed when R'-r < i<R_ be- 


age). 


cause space is not needed for non-local y* values. 
Column one of Table 4 summarizes’ the 
time, transfers, and 


summation: of that for computing rExy, ry, and sy* 
The transfers are for 


for every match position. 


total 
space used. The time is a 


Complete Sums Partial Sums 


# mult steps those for sxy and ty* those for Exy and ry? 


# add steps those for rIxy, ty, and ty? those for txy, Ly, and ty? 
# transfer R'Co-1)4E'Cr-1) SCR" Cc-1)4+C' Cr-1) 
steps +(r-1) Cc-1) +(r-1) Ccm1)] 
space R'C'+(C8te-1) Cr+2) +r(c-1) R°C'+r(3C '+2c~2) 
Table 4: Complexity of complete sums and partial sums algorithms for | 


computing RXY. In addition to the above, each approach uses 2 
subtractions, 3 multiplications, 2 divisions, and 1 square 
root operation for each of the R'C' match positions in order 
to combine terms. Both methods also require O(n) additional 
transfers for determining the maximum RXY value (and its coor- 
dinates in the input image) over ali PEs. 


the non-local y data needed. The space is for the 
PE's own  subimage, the non-local y_ storage 
described above, and the extra storage used for 


intermediate results in calculating fy and zy’, 
Once each PE has found itsS own maximum  RXY 
value, recursive doubling [19] can be used to find 


the overall maximum and its Location. This will 
require O(n) additional inter-PE transfers. 
3. PARTIAL SUMS APPROACH 
A. xxy Computation 
The partial sums procedure for computing the 
ry values consists of three steps. The first 


step is the generation of partial sums by perform- 
ing all parts of the calculation that can be done 
using the data within each PE. In the second 


step, the results of the partial sums generation 
are transferred so that each PE contains all of 
the partial sums needed to form the xxy terms. In 


the last step, the final sums are developed within 
each PE by combining the appropriate partial sums. 
The details for this procedure follow. It is as- 
sumed that the match area elements are either 
broadcast from the control unit or stored in’ each 
PE's memory, as was discussed in subsection 4.A. 
In the first step of the algorithm, each PE, 
independently of the others, computes the "partial 
sums" of match point-~image point products that can 
be computed with its own data. This can be visu- 
alized by sliding the match area M over the image 
area in each PE, as_ shown in Fig. 5. At each 
match area-image area position from Fig. 5, a 
“oartial sum" is generated. For each Location 
where an image point and match point overlap in a 
given position, the product of the image and the 
match points is calculated; all the products’ for 
that match area-image area position are then 
summed. The partial sum terms generated by this 
procedure can be viewed as forming a (R'tr-1) by 
(C'+c-1) array called "psum."' In Fig. 5, the ele- 
ment of "“psum" into which the partial sum is 
stored is given for each of the example overlap 
positions. A match position will again be num- 
bered by the input subimage coordinates (i,j) of 
the upper Left corner of the match area. Since 
the match area may not be contained in the input 
image area however, the ranges of i and j differ 
from those in the serial and complete sums algo- 
rithms. If the upper Left corner of the input su- 
bimage is considered to be position (0,0), 
“rp <71<R' and -c <j < C'. The number of par- 
tial sums that must be computed in each PE is 
(r+R'-1)CctC'-1). To develop these terms, every 
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psum(-r+1,C'-1) 


——» psum(i,j) 


psum(R'-1 ,-c+1) psum(R'-1,C'-1) 


Fig. 5. Overlap positions in the partial sums ap- 
proach, and the terms of the partial sums (psum) 
array calculated. 


element of the match area M will be multiplied by 
every element of the input area in the PE. There- 
fore the number of multiplications required is 
rcR'C'. The number of additions required is equal 
to the number of multiplications minus the number 
of terms generated, or rcR'C' — (rt+R'-1)(ctC'-1). 
Once the partial sums have been computed in- 
dependently in the PEs, it is necessary to combine 
the results from other PEs to build the complete 
SUMS. Using the criterion that the upper left 
corner element of the match area must be present 
in a partial sum for it to remain in a PE, the 
terms in the rightmost C' columns and _  bottommost 
R' ‘rows of the “psum" array are kept in the PE, 
and are Labeled "KEEP" in Fig. 6. Those elements 
"above" the kept area are transferred to PEs 
"above" this PE. Similarly, those elements "to 
the left" of the kept area are transferred to PEs 
“to the left,'"’ and the terms on the upper Left are 
transferred to PEs diagonally above and to the 
left. As in the complete sums approach, the dis- 
tance which elements will be transferred depends 
on the size relationships between r and R' and c 
and C*. The number of interprocessor transfers 
which will be required is equal to the total num- 
ber of "“psum'' terms generated minus the number of 
terms kept (which is the number of image area 
points originally in each PE). Thus, the number 
of transfers required is (rtR'-1)(ct#C'-1) - R'C'. 


In the final step of this method, the partial 
sums transferred are combined with the partial 
sums that were kept to yield the final sums ixy. 


The number of additions required to complete these 
calculations is equal to the number of partial sum 
terms that were transferred. 


Rather than implementing the partial sums 
method as three separate steps, less space is re- 
quired if the three steps for a given match posi- 
tion are executed in sequence. As_ soon as a 


R! R! 
Cc! 
Fig. 6. Partition and direction of transfer of 
elements of partial sums array. 
non-"kept" partial sum is computed, it can be 


transferred to its destination PE and saved in the 
memory Location which will eventually hold the srxy 
term of which it is a part. The execution time 
remains the same, and the only storage that is 
needed in each PE is two R'C' element arrays, one 
for the input image and one for the rxxy values. 


B. xy and sy Computation 


The partial sums algorithms for computing the 


are similar in strategy to the 
for computing the xxy values. 


sy values 
sums method 


ry and 
partial 


Each PE 
sitions 
by C! 

sums Ly and xy* 
ty and sy* algorithms. 
rithms, a C'-element vector "colsum" 


save the column sums computed so far. 
cessing of row k, “r < k <R', 


or sy” ‘terms for all match po- 
of match positions for the R' 
it contains. The partial 


computes ry 
or portions 
subimage which 


algorithms are based on the serial 


As in the serial algo- 
is used to 
After pro- 


ktr-1 
1i,j) -r<k <0 

i=0 

k+r-1 

. FGy)) 

i=k 


R'-1 
ye 2G, 
iFk 


colsum(j) O<k <R'-r 


R'-r < k < R'° 


where 0 < j < C'. Unlike the serial and complete 
sums algorithms, for each row, the leftmost sum 
consists of a single column sum, and_ (for 
~ct1 <j <0, the sum for position (i,j) is com- 
puted by adding colsum(jtc-1) to the sum for po- 
sition (i,j-1). Similarly, for C'-c<j<c', 
the sum for position (i,j) is obtained by = sub- 
tracting colsum(j-1) from the sum for position 
(i,j71). The sums (and colsums) for the topmost 
and bottommost c rows are computed in an analogous 
manner. In the "center" of each PE's subimage, 
the operations performed are identical to those in 
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is 


the topmost and leftmost edges, 


the serial and complete sums algorithms. The num- 
ber of additions performed to generate the partial 
sums in each PE will be 4R'C' + 2C'r - 2R' - 4C° - 
er + 2. As for the partial sums £xy algorithm, 
the results which must be transferred are those in 
the non-"KEEP" area in Fig. 6. Each of these ele- 
ments is added to a "kept" partial sum in the ap- 
propriate PE. The complexities of the partial 


sums sy and sy? algorithms 
column two of Tables 2 and 3. 


are summarized in 


C. SXY and RXY Computation 


As described for the complete sums method in 
subsection 4.C, for image correlation measures RXY 


or SXY, the rxy, xy, and sy” computations will be 
interleaved so that all three are computed for a 
given match position before the match area is 
moved to anew position. The computation of RXY 
is described. | 

The modifications required for the RXY computa- 


tion involve the storage for partial Cincomplete) 
IXY, LY, and sy* SUMS. In the algorithms 
described, a non-"kept” partial sum was 
transferred from the PE in which it was computed 


PE, and stored in the memory 


Location for the xrxy (or fy or zy) of which it 
was a part. For the complete RXY computation, 
space is not needed for all of a PE's local 


Ixy, Ly, and sy@ values. Provisions must there-~ 
fore be made for the incomplete sums. The pro- 
cedure will be based on the sy algorithm. It will 
be explained in terms of three cases (ranges of 
match positions). It will be shown that at most 
20¢(C'+c-1) (r-1)+c-1] Locations are required to 
hold incomplete sums at any point in time. 

Consider first a match position which is’ fully 
contained in the PE's subimage, i.e., position 
Ci,j) where O <i< R'-r and QO<j<C'-c. The 


computations can be interleaved, 


to. its destination 


2 
Ixy, ZY, and ry 


and RXY for the position can be computed. For the 
same i range, when j exceeds C'mc, (Ci.e., 
C'=c < j < C') two partial sums must be combined 
to produce~ the complete sum for each of 


xy, Ly, and sy’. After computing the Ixy partial 
sum for position (i,C'=ctk), 1 < k < c, each PE 
can compute the partial sum for position (i,-ctk) 
and transfer its value to the left, where the to- 
tal xxy sum for position (i1,C'=ctk) is then com- 
pleted. By postponing computation of Zxy for po- 
sition (i,-ctk) until it is needed, no extra space 
required for row i's incomplete Ixy values. A 
comparable savings in space cannot be realized in 


the ty and sy? computations, since the computation 
of these partial results cannot be postponed 
(without doing additional summations). Except for 


the ty (Cand sy*) 
sum for each match position is computed in terms 
of the sum for a previous match position, so (1) 
some previous results must be saved, and (2) the 
order in which the sums are computed cannot be al- 
tered or interrupted. The partial sum for posi- 
tion Ci,-ctk), 1 < k < c, must be computed and 
saved until it can be combined with the partial 
sum for position (i,C'-ctk). The same c-1  loca- 


tions can be used for each row i in the range. 
Thus, c-1 locations are needed to save partial ry 


results, and c-1 Locations for partial ny* results 
for match positions (i,j) where O < i< R'=r_= and 
-ct1 <j <0. nee 
Similarly, for partial match positions along 
the top of the PE's subimage, where the row index 
is “rtk, 1< k <r, computation of partial txy 
sums can be postponed until they are needed, but 


the partial ty and ny? sums must be computed and 


saved until the corresponding ty and sy? sums for 
row R'=rtk have been calculated. Here, separate 
Storage locations are needed for each row i, 
-r < 1 <0, and column j, -c < j <C'. Therefore, 


for each of zy and ty’, (C'+c-1) (r-1) additional 
Locations will be needed. 

The partial sums complexity for computing RXY 
1s summarized in Table 4. The time is a summation 


of that for computing <ixy, ty, and sy” for every 
match position. The transfers are for the 
non-"kept" partial sums. The space is for the 


PE's own subimage, the incomplete xy and ny* 
values which must be saved until they can be com- 
pleted, and the intermediate results in calculat- 


ing xy and sy’. 

As in the complete sums method, recursive dou- 
bling can be used to obtain the position of max- 
imum correlation over all of the PEs. 


6. CONCLUSIONS 


Tables 1 through 4 contrast quantitatively the 
complete and partial sums approaches to the opera- 
tions involved in image correlation. In order to 
more readily compare the two approaches, let 
R'=C' =I1I' andr=c=M'. The~ results” are 
shown in Table 5. 

As can be seen from the table, for each of the 


individual «xy, ty, and ny? algorithms, the com- 
plete sums approach requires more space and/or ar- 
ithmetic operations than the partial sums ap- 
proach. However, when these algorithms are inter- 
leaved to compute and locate the maximum RXY 
value, the complete sums method requires more ar- 
ithmetic operations, but fewer inter-PE transfers 
and less. space. Which method is faster will 


therefore depend on the relative time to perform. 


arithmetic operations versus transfers. For exam- 
ple, if the time to perform a transfer equals the 
time to perform a multiplication, then the com- 
plete sums method will be faster. If inter-PE 
transfers can be overlapped with arithmetic opera- 
tions, then the partial sums method will be fas- 
ter. Thus, in order to determine which approach 
will be faster on a particular system, the exact 
timings for these operations must be considered. 

The difference in the space required for _ the 
two approaches is not large. However, if the PE 
memories are small, or, for the RXY computation, 
if C' is large, the space difference may be a fac- 
tor in selecting an algorithm. 

Some basic differences resulting from the two 
algorithm strategies are evidenced in the RXY com- 
plexities. In the complete sums approach, two PEs 
hold and operate on some of the same image ele- 
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Space 


Ac-Ap Mic-Mp Tp-Tc Difference 
rxy 0 o 0 Sc-Sp= 
20M") aM! +3 
ty 81'+3M'-3 - 0 Sc-Sp= 


(mt) eq 


sy? | gr '43m'-3 21'M'=21' 0 Sc-Sp= 
+ (N°) 22M" 44 cm" 24 

RXY | 161'+6M'=6 21'M'=21! 2(21'M'=21! Sp-Sc=21'M! 
#(M9)°-2m941 CMD SHOM4T) 27 = 2M42 


Table 5: Comparison of the complete sums and partial sums approaches, 
using the following notation: 
R! = Cc! = I' 
r=cz=Mm' 
Ap = adds for partial sums approach 
Ac adds for complete sums approach 
Tp inter-PE transfers for partial sums approach 
Tc inter-PE transfers for complete sums approach 
Mp multiplies for partial sums approach 
Mc multiplies for complete sums approach 
Sp space for partial sums approach 
Sc = space for complete sums approach 


uuu wu ton 


ments. As a result, redundant arithmetic opera- 


tions are performed in the ty and ny? computa- 
tions, i.e., the sum Cor product) of the same two 
elements is sometimes performed in two PEs. These 
redundant operations are not performed in the par- 
tial sums algorithms. On the other hand, the par- 
tial sums method requires more transfers. Each 
non-local y value needed by a PE is transferred in 
only once in the complete sums approach and three 


times (Cas part of partial gy, sy’, and rxy terms) 
in the partial sums approach. 

The SIMD machine model used assumed a multi- 
stage network which can perform each required data 
transfer in a single step. Consider instead an 
SIMD machine where the PEs are connected ina 
nearest neighbor pattern, i.e., PE 1 iS connected 


to PE it1, int, itn’ /*, and i-n!/* (arithmetic mod 
N). Examples of such machines are the Illiac IV 
C43, DAP (7), CLIP4 [5], and MPP [3]. In analyz- 


ing the two algorithm approaches, the number of 


transfer steps must be increased. Assuming 
NC = NR = nile the nearest neighbor connection 
scheme requires 1 transfer step to do each of the 


PE i to PE it1, i-1, i+NC, and i-NC transfers, and 
2 transfer steps to do _ each of the PE i to PE 
i+NC+1, itNC-1, %i-NC+1, and i-NC-1 transfers. 
Furthermore, if the match area extends over more 
than two PEs, additional multiple data transfer 
steps will be needed. (Even though two transfers 
are required for some steps, typically each 
transfer in a nearest neighbor network will be 
faster than a transfer through a multistage net- 
work.) The results in Tables 1 to 5 can therefore 
be applied to nearest neighbor connected systems 
by modifying the transfer step counts as 
described. (The number of transfers for recursive 
doubling will also be increased.) 

In the SIMD machine model in section 1, it was 
assumed that each processor was associated with a 
Local memory to form a PE. Consider a different 
organization where the processors are separate 
from the memories, and the interconnection network 
is used to connect the processors to the memories. 
Inter-processor communications can be accomplished 
by writing into and reading from the’ shared 
memory. STARAN is an SIMD machine organized in 


this way (1,2). Since all memory accesses go 
through the interconnection network, there are no 
explicit inter-processor data transfers (assuming 
a network such as one of those mentioned in sec- 
tion 1 were used). Thus, with such an organiza- 
tion, the partial sums approach is faster than the 
complete sums approach. (In the STARAN machine, 
the interconnection network is not flexible enough 
to allow the processors to access the appropriate 
memories in all cases (e.g., processor i to memory 


it+NC+1). In these cases, an additional pass 
through the network will be required to align the 
data.) 

The SIMD algorithms presented demonstrate how 


SIMD parallelism can be used to reduce the execu- 
tion time of computationally intensive image pro- 
cessing tasks. For the image correlation algo- 
rithms, the asymptotic complexity for arithmetic 
operations is reduced from O(RCrc) for the serial 
algorithm to OCRCrc/N) for the N-PE parallel algo- 
rithms. The overhead of inter-PE communications 
incurred has asymptotic complexity OCC'r+R'ctro). 

In summary, SIMD algorithms to perform the 
window-based operations needed for image correla- 
tion have been explored. Two fundamental algo- 
rithm strategies were presented, and their time- 
space-transfer complexities were compared. 
Through studies and analyses such as this, more 
can be learned about both the art of parallel pro- 
gramming and the ways in which parallelism can be 
exploited in image processing. 
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Abstract -- Image processing problems frequently 
involve large structured arrays of data and a need 
for very rapid computation. Special parallel pro- 
cessing schemes have evolved over the Last 20 years 
to deal with these problems. In this paper the na- 
ture of image processing tasks are outlined and the 


parallel computer architectures which have been 
developed for these tasks are reviewed. Most of 
these special architectures may be loosely classi- 


fied as either SIMD or pipeline structures although 


some MIMD structures have been designed for high 
Level image analysis. 
In recent years several Multiple SIMD (MSIMD) 


schemes have been proposed as suitable architec 
tures for image processing. The fundamental prob- 
lems of developing an effective MSIMD system are 
discussed .and a simple SIMD/MIMD computational 
model for comparison with such systems is proposed. 


Introduction 

Image processing frequently involves very Large 
regular data structures and a need for very high 
speed computation. Through the brief history of 
digital image processing, special parallel process- 
ing architectures have been proposed and implement- 
ed, see [1] and (44). 

In this paper we will consider architectures 
which deal with images in the most conventional 
format, namely a Large two dimensional matrix of 
brightness values called pixels. This format is 
applicable to many applications. For example, in 
the industrial environment computer robot vision 
and part inspection may involve real-time video 
data as small as 256 x 256 pixels with only 6-bits 
of information for each pixel. Military applica- 
tions involving FLIR and optical video imagery may 
involve similar sized data. At the Large end of 
the scale, remotely sensed imagery from the LANDSAT 
satellite may involve images of 4000 x 4000 of 8 
bit pixels and several spectral bands of informa- 
tion. Aerial photographs may also be digitized to 
4000 x 4000 or Larger. In the biomedical area 
there are many instances of imagery, mainly from 
microscope sides, at all levels of size, color, and 
resolution. 

The nature of image processing problems may be 
divided into two characteristic classes: low Level 
image processing and image analysis. In low Level 
image processing, the output usually has a similar 
matrix size to the input. The processing may in- 
volve algorithms for restoration, noise removal, 
geometric correction or simple feature extraction 
such as edge detection, or feature enhancement. 
Such tasks are well suited to an SIMD computer 
structure and most special image processing 
hardware systems are of this form. 

The image analysis involves classifying segments 
or features of the image into known classes. This 
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may involve combining a set of segments or features 
to create a total composite object. The techniques 
involved here are usually termed pattern recogni- 
tion or artificial intelligence. For these cases, 
the image is no longer considered as a Large matrix 
of pixels. Segments of the image are represented 
by more convenient data structures such as a set of 
parametric measures or a labeled relation graphs. 
Classification algorithms frequently involve a_ set 
of sequential searches for pattern matching which 
may be conducted independently in parallel. Such 
tasks, which involve many independent operations on 
a common data base, are -well suited to a MIMD 
parallel structure. 

There has been much work at both extremes of im- 
age processing; many Low level image processing al- 
gorithms have been developed as have many pattern 
recognition techniques. However, the interface 
between these two areas is less well understood. 
This grey area involves deciding which features are 
to be extracted by the Low Level image processing 
and in what format they should be presented to the 
image analysis section. A related problem is how 
the image analysis algorithm may interrogate the 
original Low Level data to obtain further informa- 
tion when necessary. 

The areas of image processing are outlined in 
Fig. 1. SIMD processors are well suited to most 
Low Level algorithms and can also perform many sim- 
ple feature extraction operations. In some cases 
they may be effectively utilized for some statisti- 
cal classification techniques. MIMD processors are 
not very efficiently organized for low Level image 
processing since much hardware is devoted to indi- 
vidual control units and reliable asynchronous data 
communication between processor units. There is 
also a problem with sharing near neighbor data 
between the processor units. Some feature extrac- 
tion algorithms, especially those serial in nature, 


such as contour following, are well suited to the 
MIMD structure. Classification schemes especially 
those involving syntactic methods are very well 


suited to the MIMD structure 

An initial problem in designing a_ complete 
parallel image processing system is to decide which 
basic machine architecture to use: SIMD, or MIMD. 
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Fig. 1 The main stages of Image Processing. 
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Some researchers have proposed a combined MSIMD 
system as a possible solution. These schemes en- 
able a group of independent SIMD processors to be 
assigned to a task. One problem in designing such 


systems is to ensure’ that the worst features of. 


both systems, i.e., the expense and inefficiency of 
the MIMD data communication and control, and the 
inflexibility of the homogeneous SIMD structure are 
not both present in the combination. In this paper 
a simpler computational model is proposed which may 
be used as a benchmark for the efficiency of more 
complex designs. | 

Historically, a fundamental bottleneck in pro- 
cessing capability has been perceived with the low 
Level image processing task. Many SIMD and pipe- 
Line architectures have been developed for low lev- 
el image processing applications, whereas special 
architectures for high Level image analysis have 


received less attention. Most of the low Level ar- | 


chitectures have the characteristic that a single 
operation is automatically applied to all elements 
of the image matrix in a fixed amount of time. 
These architectures may be divided into three 
types: (a) Parallel Binary Array Processor (b) 
pipeline processor and (c) special function units. 

In the following sections of this paper the ar- 
chitectures which have been developed for low Level 
image processing are reviewed with emphasis on more 
recent designs. Then the features of proposed 
MSIMD schemes for image processing applications 
will be discussed. 


Parallel Binary Array Processors 

Binary Array Processors [11] operate in the sin- 
gle instruction stream-multiple data stream (SIMD) 
mode with a matrix of identical processing elements 
(PE's). The whole image or a consecutive block of 
the image is distributed through the PE's and pro- 
cessed in parallel. ALL data paths within a PE are 
only one bit wide and each PE is connected to PE's 
adjacent to it. 

The main features of the BAP scheme result from 
the bit-serial architecture of the PE's and near- 
neighbor interconnection scheme. The bit-serial 
architecture allows flexible data formats and makes 
the BAP very efficient with respect to memory and 
processing resource utilization. Many image pro- 
cessing algorithms require that data within § Local 
areas of each pixel is to be combined; the near 
neighbor interconnection scheme enables these algo- 
rithms to be efficiently implemented. 

In 1959 Unger [6,7] proposed a parallel BAP 
machine with a matrix of PE's for image processing 
applications; many of Unger's ideas have been in- 
corporated into Later BAP designs. One of the 
first hardware BAP systems was the SOLOMON computer 
C8] which was built by Slotnick at Westinghouse. 
Another Landmark parallel BAP was Illiac III C9] on 


which development started in 1963 by McCormick but 


which was never’ completed. Early parallel BAP 
development was hampered by the very high cost of 
the parallel hardware. Both SOLOMON and Illiac III 
were designed to have a 32 x 32 PE matrix size. 

With the advent of large scale integration the 
construction of much Larger BAP's has become feasi- 
ble. Since the Illiac III design a sequence of 
BAPs called CLIP have been developed by Duff [10]. 
The most recent version, CLIP4, is an operational, 
LSI, 96 x 96 parallel BAP. 
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The features of current BAP designs will be dis- 
cussed by describing two diverse architectures. 
These are: the BASE system which is being developed 
in a small prototype form at Purdue University 
(11,121 and the MPP [16]. 

The BASE PE is shown in Fig. 2. It consists of 


= Boolean. : 


Processor 
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- Function 


Fig. 2 The BASE PE Organization 


three main components: a Boolean processor which 
can implement any three input Boolean function, a 
1-bit wide local memory and a Near Neighbor func- 
tion processor (NNF),. In general, the operands A, 
B and C would be held in 1-bit registers which 
could receive data from the Local memory on a_ com- 
mon bus Line. 

The NNF receives data from the 8 near neighbor 
PE's as shown in Fig. 3 and realizes the following 


function 
F= V (9. N N.) 


Where {g,.--do4 is an 8-bit control vector. A hex- 


agonal near neighborhood may also be selected. 
There are three fundamental instruction types 
for BAP's [11]: Boolean, simple near neighbor and 
recursive near neighbor. Boolean instructions are 
used for logical and arithmetic operations within 
the Local memory; the NNF is not used. The BASE 


Fig. 3 Data interconnections between a BASE PE 
and its 8 adjacent PE's 


Boolean processor can implement any of the 256 
‘three input Boolean functions as specified by an 
8-bit control vector. A 2-input Boolean function 
could be adequate for Boolean instructions; howev- 
er, the 3-input function is much more efficient for 
arithmetic operations and is also necessary for re-~ 
cursive near neighbor instructions. Arithmetic may 
be achieved with conventional bit-serial algorithms 
or, in some cases, functions may be more efficient- 
Ly implemented with a specially optimized instruc- 
tion sequence [13]. 

Simple near neighbor instructions 
one operand comes from a 
selected subset of near neighbors. 
of near neighbors is useful in some binary image 
algorithms. For example, the perimeters of all ob- 
jects in a binary image may be obtained with one of 
these instructions. 

Recursive instructions are implemented on only a 
few BAP's. In this instruction the near neighbor 
output is taken from R rather than A in Fig. 1. A 
signal may propagate through a connected sequence 
of PE's in a single recursive instruction. This 
asynchronous operation may be several times faster 
than an equivalent sequence of simple near neighbor 
instructions depending upon the technology and de- 
tailed design of the processor. Recursive near 
neighbor instructions are useful for horizontal ar- 
ithmetic, which treats the rows of the PE matrix as 
a set of data items, and some two dimensional 
binary topology operations. 

A prototype BASE system, 
currently being developed at Purdue University 
C12]. It is constructed with Schottky TTL and in- 
volves 64 PE's arranged in an 8 x 8 matrix. The PE 
matrix can automatically scan a Larger matrix size 
processing it in consecutive 8 x 8 blocks. 

A general block diagram for a Large scale BAP 
system is shown in Fig. 4. The data processing is 
achieved by the array of PE's which simultaneously 
process a submatrix of an image. The total image 
is processed as a sequence of these submatrices. 
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Fig. 4 Binary Array Processor System 
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Data is input to and output from the PE array via 
the 1/0 buffer memory which communicates the data 
to image peripherals and conventional computer bulk 
storage devices. With the current LSI systems each 
bit plane is input along one edge of the PE array 
one column on each clock cycle. Each row of the PE 
array acts as a shift register. When the complete 
bit plane has been input it is stored in the Local 
memory in one clock cycle. When input data is in 
the form of a stream of pixels it must be converted 
to the bit-plane format. Instructions to the PE 
array are issued by a single high speed micropro- 
grammed controller. The whole system synchroniza- 
tion is maintained by a conventional host computer 
which issues macro instructions to the controller. 
Some feature information may be extracted from the 
PE array by the global information extraction 
mechanism. 

The most usual mechanism is an OR function over 
all PE's which indicates if any PE has a one ele- 
ment. This is useful for terminating loops and for 
detecting the presence of objects. For applica- 
tions which require more feature extraction, such 
as shape analysis, a more powerful scheme is desir- 
able. One scheme is to count the number of bits 
set in the bit plane which is implemented on BASE-8 
and could be efficiently implemented on LSI _ BAP 
systems [14]. 

The MPP was designed by NASA [15] with the pri- 
mary function of analyzing LANDSAT satellite data. 
The system has been redesigned by Goodyear 
Aerospace [£16] who are contracted to build a 
hardware MPP by mid 1982. The PE array consists of 
16384 PE's organized in a 128 x 128 matrix. A spe- 
cial LSI CMOS/SOS chip has been designed which con- 
tains 8 PE's without their Local memory ina 2 x 4 
submatrix. In the initial version each PE will 
have 1024 bits of local memory; however, the design 


includes provisions for up to 16k bits per PE for 
when such memory chips become available. The clock 
cycle time for the array is 100 ns. 

A simplified MPP PE is shown in Fig. 5. The em- 


phasis with this design is fast arithmetic computa- 
tion rather than binary near neighbor operations. 
The NN select unit enables a bit plane to be shift- 
ed in one of the four cardinal directions in_ one 
instruction. The Boolean processor implements all 
16 possible Boolean functions between the P regis- 
ter and the value on the data bus; in this case P 
is an accumulator. For arithmetic operations a 
dynamically reconfigurable, variable Length shift 
register with a maximum Length of 30-bits and a 
full adder are available. This organization is 
faster than the BASE scheme especially where multi- 
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Fig. 5 Simplified MPP PE Organization 


plication is involved; in this case the shift re- 
gister is used for circulating the partial product. 

The MPP is very fast for 8-bit pixel operations; 
it can execute 6.5 billion additions per second or 
1.8 billion multiplications per second. For 32-bit 
floating point data the MPP can execute 430 million 
additions per second or 210 million multiplications 
per second. The integer addition is optimal for 
the given processor memory bandwidth, the multipli- 
cation could be made faster. It is anticipated 
that with VLSI technology faster multiplication and 
other processing functions will be included in PE 
designs and PE arrays will become Larger (17,18]. 
A special programming Language called Parallel Pas- 
cal has been developed for the MPP and other BAP's 
C19]. A compiler for this Language is currently 
being developed and a translator is available which 
allows Parallel Pascal programs to be run on con- 
ventional Pascal systems. 

Other parallel bit serial processors have also 
been used for image processing. The STARAN associ- 
ative processor [41] has been used for LANDSAT im- 
age analysis [42]. It can perform similar opera- 
tions to a BAP except that it can only process a 
row or a column of the image at a time in parallel. 
The distributed array processor (DAP) [40] is a BAP 
implemented with a 64 x64 PE matrix which was not 
designed specifically for image processing. A LSI 
chip has been developed for it which contains 4 
PE's without Local memory and is based on the un- 
committed Logic array (ULA) approach. Low level 
image processing tasks have been smpReMented on the 
DAP (39). 


Pipeline Processors 
The basic organization of a pipeline 
for image processing is shown in Fig. 6. 
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Fig. 6 Computer Organization for Pipelined 
Image Processing 


data is passed in raster scan format from the 
buffer memory into a pipeline of processing stages. 
The function of each stage is specified by the host 
computer through the instruction bus. Once set up, 
a processing stage performs the same operation on 
every element of the data sequentially. There is 
an initial set-up delay while data is input to the 
pipeline before the first result appears at the 
Last stage then a result is generated with each 
clock cycle. Therefore completely processing an 


n x n image requires the set-up time plus n@ clock 
cycles. 

One main advantage of this architecture is that 
no input data reformatting is necessary. In fact 
the input to the first stage could be taken direct- 
ly from the output of a TV camera. Other advan- 
tages are the very simple data interconnections 
between processing stages and, since an instruction 
resides in a stage for many clock cycles, a _ high 


speed control unit is not required. 
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A pipeline stage can implement near neighbor 
processing functions by means of two shift regis- 
ters as shown in the near neighbor processing unit 
(NNPU) in Figs 7. Raster formatted data is input 
at register N, and is then shifted through the 


shift registers 
jt reaches Noe 


and near neighbor registers until 
To process an image with a row of n 


elements the shift registers are set to Length n-3. 
Then an input data element will be at register P 
after nt+1 clock cycles and its near neighbor values 
will be available in registers NA° °°Ng. A small 


amount of additional logic is required to maintain 
the correct values of near neighbors at the edge of 
the image. A simple near neighbor function may be 
implemented by a near neighbor function unit con- 
nected to the nine registers as shown in Fig. 7. 
The set-up time for this NNPU is nt2 clock cycles. 


a 1s | 
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Fig. 7 Organization for a pipelined near 
neighbor processing unit (NNPU) | 
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One of the most ambitious pipeline image proces- 
sors which embodies these concepts is the cytocom- 
puter developed by Sternberg at the Environmental 
Research Institute of Michigan (20). Currently a 
prototype cytocomputer is operational which in- 
volves two pipelines: one contains 80 binary 
stages and the other contains 25 grey Level stages. 
The data connections between the stages are 8 bits 
wide. The binary stage can compute a binary func- 
tion on the near neighbors of one of the inputs and 
combine the result of this with the other 7 bits. 


ALL 2? near neighbor functions are implemented by 
means of a table-lLook-up memory which generates 8 
outputs. A select unit selects 8 bits from the 
8-bit input data and the 8-bit near neighbor func- 
tion and these are mapped by means of a 256-word 
8-bit table-look-up memory to the 8 outputs for the 
stage. For the grey level stages the architecture 
is similar to the NNPU; in this case all shift re- 
gisters are 8-bits wide and the near neighbor func- 
tion involves 8-bit arithmetic operations. The | 


8-outputs are also mapped through a 256-word 
tab Le- look-up memory. 
An LSI version of the cytocomputer is currently 


being developed. It will consist of up to 550 
stages; each stage is capable of both binary and 
gray level functions. A special LSI CMOS/SOS chip 
is being developed which will contain one cytocom- 
puter stage. 

Eskenazi and Wilf have developed a simple pipe- 
Lined processor system for real-time image analysis 
at the Jet Propulsion Laboratories [22]. The sys- 
tem is designed to produce partial object boun- 
daries from raw image data and involves three dif- 
ferent processing stages, each of which operates on 
a 3 x 3 Local neighborhood. 


The cytocomputer structure is an effective 
mechanism for very low level image processing 
(e.g., image filtering operations). Lougheed and 
McCubbrey have made a comparison between the cyto- 
computer and other parallel structures for some 
very low Level image processing tasks [21]. Howev- 
er, it is not clear how such a structure can easily 
combine more than one image in an operation or per- 
form functions such as geometric correction. 

These Limitations are due to the Limited 8-bit 
Linear interconnection scheme of the pipeline. It 
would be possible to have a more complex pipeline 
involving stages that could deal with more than one 
8-bit input. However, the cost of implementing a 
more flexible interconnection scheme would be very 
high. 

The FLIP system [23] consists of 16 special pro- 
cessors with two input ports and one output port 
(they do not contain a 3 x 3 local window function 
Like the cytocomputer). These may be reconfigured 
into any interconnection arrangement by software. 
Interconnections are achieved by 16 interprocessor 
buses and 16 buses from the I/O control unit = and 
buffer store. Each processor is capable of execut- 
ing a simple sequential algorithm on each input 
pixel; it contains 256 words of program store and 
50 bytes of data store. The processor intercommun- 
ication is achieved by asynchronous handshaking, so 
that the whole system will function correctly when 
the processors require different processing times 
for each pixel. The multiple input ports from the 


buffer memory enables a pixel and selected near 
neighbor values to be input to the processors 
simultaneously. 


Special Function Units 
The organization of the third architecture type, 
special function units, is illustrated in Fig. 8. 
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Fig. 8 Special Function Processor System 


Each special function unit (SFU) is a direct 
hardware implementation of an image processing al- 
gorithm, or in some cases it may implement a set of 
related functions. SFU's may contain some Local 
memory and program control. A system usually con- 
sists of a set of SFU's and a fast image memory 
connected together by a high speed bus. The image 
data is processed serially by a SFU, which may be 
pipelined, and image results are returned to the 
fast memory. | | 

Early near neighbor processors were based on 
this scheme [2-5]. For some current medium speed 
SFU systems see also [26,37,43,44]. Current sys- 
tems usually involve bit-parallel data and may in- 
volve SFU's which process several pixels simul tane- 
ously. : 

An important image processing operation is  con- 
volution with a small predefined matrix. Many im- 
age processing functions such as convolution and 
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the DFT may be defined by an inner product opera 
tion. The inner product computer (IPC) [24-25] is 
a special function unit for rapidly computing the 
inner product operation. For two vectors A and B 
the inner product is defined by 


Fig. 9 An inner product computer design 


involves n multipliers to compute the vector ele- 
ment products and a tree of n-1 adders to sum_ the 
results. In practice the design may be pipelined 
to achieve very high speed; moreover the logic is 
simple to design and lay out. The bandwidth of the 


system may be exactly matched for integer operands. 
In the example, 2 4-element 8-bit vectors result in 
an inner product of 18 bits. 

A possible organization for an _ IPC 
Weights are 


system is 


shown in Fig. 10. stored in the 
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Fig. 10 An IPC computer organization 
for Image Processing 


memories M,°°°M and data is shifted from the data 
store into the shift register S4°°*S, 


To convolve on n-element kernel function with an 
image the n-kernel values are first loaded into the 
n-memories then the image data is shifted into the 
shift register S. After the set-up time for the 
IPC a new data item will be input and a result will 
be output with each clock cycle. If a two dimen~ 
sional convolution is required then delay shift. re- 
gisters could be added to S as shown in Fig. 7. 
The set-up time would be increased but the follow- 
ing execution time would still be equal to the num- 
ber of results. 

General transform operations such as the 
Discrete Fourier Transform (DFT) may be achieved by 
Loading the M memories with all the transform coef- 


and the S registers with the data to be 
transformed. The addresses of the memories are 
stepped through sequentially and a result is gen- 
erated with each clock pulse after the set-up time 
of the IPC. A complex DFT can be achieved on an 
IPC for real numbers with four inner product opera- 
tions C241]. 

The parallel pattern processor (PPP) [26] in- 
volves an 8 input IPC with buffer memory for seven 
Lines which can implement an 8 x 8 convolution 
function. The Dynamic Spatial Reconstructor [24] 
for very high speed tomographic reconstructions in- 


ficients 


volves avery fast IPC based on sub-nanosecond ECL 
technology. 
In some cases the SPU may contain several SIMD 


PE's which operate on different pixels (1,5,43] For 
example, the FIP processor in the PICAP II system 
C43] contains 4 programmable PE's which can execute 
a small convolution or a programmed sequence of 
near neighbor operations on 4 pixels simul taneous- 
Ly. Several rows of the image are rapidly accessi- 
ble in a Local buffer memory. 

Initial research has been done on 
image processing functions with charge coupled dev- 
ices (CCD's). © This technology processes unique 
features such as: Low power delay product (0.1 pico 
Joules), very high packing density, and simple ana- 
log signal processing implementation. Moreover, 
Since image sensors can be made with CCD's the pos- 
sibility of including some low Level processing on 
the same substrate as the sensor exists (smart sen- 
sors) . . 

Several test chips have been developed for com- 
puting local functions such as edge detection, con- 
volution filtering, median filtering and adaptive 
thresholding (30-32). These test chips require 
data to be input in analog form from external CCD 
shift registers. Real-time processing of a video 
signal (7.5 MHz) is possible; however only about 
6-bits of precision can be maintained with the ana- 
log processing. 

For high speed parallel processing a sind (fo- 
cal plane) processing concept has been proposed. 
In this scheme, data from a CCD matrix is shifted 
into a set of processors connected to one edge of 
the matrix. The outputs of the processors are con- 
nected to one edge of a result matrix into which 
the results are shifted. These proposals include a 
single substrate design for simple processing func- 
tions £31] and a multi-chip system for more complex 
processing [32]. 

CCDs have also been proposed for on-board Satel- 
Lite classification of LANDSAT remotely sensed data 
C33]. This proposal is based on the concept of a 
CCD IPC device. 


Characteristics of Low Level Architectures 

In summary, some of the features of the current 
parallel architectures for image processing are as 
follows. For very high performance the BAP ap- 
proach may be used; the parallelism may be in- 
creased until there are as many PE's as pixels 
without any fundamental problems... Once there are 
as many PE's as pixels, the PE's with bit-parallel 
rather than bit-serial operations may be considered 
for even higher performance. For very Low Level 
image processing the pipeline approach may offer a 
simpler solution than the BAP system. Both BAP and 
pipeline systems are very suitable for VLSI imple- 


implementing 


memory resources. 


mentation since they involve .a large number of 
identical modules. Furthermore, all chip intercon- 
nections are very short. Therefore, no. scaling 
problems should be experienced when the number of 
processing units is increased. For medium speed 
applications the special function approach is pos- 
sible. Higher speed may be achieved by using a 
wider high speed bus and redesigning the functional 
units to have several SIMD PE's. Some current sys- 
tems would be difficult to extend in this way. 
Scaling problems may be experienced when the size 
of a system is expanded since the Length of the bus 
may need to be extended and data interconnections 
may be more complex. In certain applications ei- 
ther very high speed Logic or CCD technology may be 
appropriate. 

Most current pipeline and special function units 
are based on 8 bit-wide data paths. This precision 
is adequate for representing most image data, but 


partial results may easily require more. informa- 
tion. The BAP approach is not Limited to a fixed 
pixel size and can easily deal with floating point 
data. 7 | 


MIMD and MSIMD schemes 

There has been much interest in recent years’ in 
developing multiple instruction stream--multiple 
data stream (MIMD) parallel processors. Such a 
computer may be constructed with a set of indepen- 
dent processors executing different programs, which 
Can communicate with each other and share some 
An organization for this type of 
‘is shown in Fig. 11. The processors usu- 
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Fig. 11 A MIMD Processing Organization 
ally have some Local memory but also have access to 
a common shared memory. The interconnection net- 
work which connects the processors to the shared 
memory presents a significant design problem espe- 
cially when a Large number of processors are in- 
volved. A crossbar switch scheme would allow any 
processor to access any shared memory module. How- 
ever, it is usually too costly to implement. A 
Limited permutation network is usually more practi- 
cal. However, significant delays in accessing data 
from the shared memory may occur due to the’ Longer 
path length through the network and blocking. The 
availability of cheap microprocessors has made the 
construction of these computers feasible. However, 
fundamental research problems still exist as to the 
best methods to interconnect the processors and 
dynamically distribute the processing tasks between 
them. 

The main advantage of the MIMD scheme over the 
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SIMD scheme occurs with high Level image under- 
standing tasks. For example, it would be possible 
to assign a set of processors to analyze a set of 
segmented objects where each processor deals with a 
single object. Alternatively several processors 
could be set to analyze a single object where each 
processor may realize a different analysis algo- 
rithm. Low level near neighbor algorithms are usu- 
ally more easily implemented with the SIMD scheme. 


There have been proposals for MIMD and MSIMD ar- . 


chitectures which are intended for image processing 
applications ([27,28,29,35,36]. Two microprocessor 
based MIMD designs are currently being researched 
at Purdue University. One scheme, the Partition- 
able SIMD/MIMD system (PASM) [27,28] involves 1024 
microprocessors and 16 control units. The proces- 
sors are interconnected with a Limited, synchronous 
interconnection network. PASM may be dynamically 
reconfigured into a set of different size, indepen- 
dent SIMD systems to suit the requirements of a 
processing task. Up to 16 independent SIMD systems 
may be possible at one time, each controlled by one 
control unit. The number of processors allocated 
to a control unit may be dynamically varied; at one 
extreme a single control unit may control all 1024 
processors in which case it behaves Like a_ single 
SIMD system. There is no general shared memory in 
the PASM system, rather data is shared through in- 
terprocessor data connectors. 

In the second scheme, the Purdue multi-mode mul- 
timicroprocessor [29] about 128 microprocessor are 
considered. Each microprocessor has its own con- 
trol unit and local memory so it is a true MIMD 
system. In early designs an interprocessor mechan- 
ism for synchronizing a set of processors in an 
SIMD mode was proposed however this is not con- 
sidered important in the current design. Block 
data transfers are made between the Local processor 
memories and the shared memory; most processor data 
accesses are made to their Local memory. With this 
scheme a simple processor interconnection network 
can be used and the delay in establishing a path 
from a processor to a shared memory module is 
offset by transferring more than one data word. In 
this sense, the processor Local memories may be 
considered to be more Like cache memories. 

In both the above systems high speed low Level 
image processing is achieved by distributing the 
task amongst as many available processing units as 
possible. The hardware utilization for these tasks 
is not as good as the special architectures previ- 
ously discussed since only one control unit is 
necessary and multiple control units are a_ feature 
of an MIMD system. 

The following simple SIMD/MIMD model is proposed 
which may offer better hardware utilization. This 
simple model is outlined in Fig. 12. A single SIMD 
processor is used for low level image processing 
and a separate MIMD processor is used for image 
analysis. The size of each processor would be ini- 
tially configured to be adequate for the anticipat- 
ed work load. In large applications, or when high 
reliability is necessary a loosely coupled network 
of several SIMD and MIMD processors could be used. 

In the general scheme shown in Fig. 12, no 
specific architecture details are specified such as 
the nature of the interconnection network or method 
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Fig. 12 


of 1/0 interfacing. The host of the MIMD system 
may be simply an I/O channel if there is a distri- 
buted operating system. The main advantage of the 
general structure is that the best features of both 
the SIMD and MIMD systems are easily attained. 

The data flow for Low level image processing and 
image analysis are frequently quite different. In 
the proposed system the SIMD processor would be op- 
timized for low Level image processing and the MIMD 
system would be optimized for image analysis. The 
organization of the MIMD system would be much 
simpler than the MSIMD system since it would not 
have to deal with the SIMD data flow. SFU's may be 
added to the MIMD processors to assist with well 
defined image analysis tasks. 

There are two possible cases when the MSIMD sys- 
tem might have an advantage over the proposed sys~ 
tem: (a) if the ratio of Low Level image process- 
ing to image analysis was very unpredictable and 
(b) if the SIMD Low Level image processing re- 
quired the processing of many, very small, sub im- 
ages. The ratio of processing types in case (a) 
would have to have a very great variability before 
the extra expense of the MSIMD system would be war- 
ranted. For case (b), Large SIMD systems may be 
partitioned by means of masks so that an identical 


algorithm may be executed simultaneously on a set 
of sub images. Therefore, a reduction in efficien- 
cy of the single SIMD system would only occur if 


the algorithm for the sub images were very dif- 
ferent from each other. There would be very Little 
reduction in efficiency if the Low Level image pro- 
cessing was realized with a pipeline structure. 


Conclusions 
Image Processing tasks may be divided into two 
areas: Low Level image processing and image 
analysis. Historically,the low Level image pro- 
cessing problem has received the most attention and 
several viable parallel computer architectures have 
been developed for this task. Some of these archi- 


tectures have been reviewed and their characteris- 
tics are summarized in section V. Current parallel 
architectures are very efficient for implementing 


current Low Level image processing algorithms and 


‘they are suitable for VLSI implementation. 


several MSIMD architectures have been proposed 
in the Literature for image processing. A 
SIMD/MIMD system has been proposed in this paper 
which is simpler to organize than the MSIMD scheme 
and will be more efficient for well defined image 
processing tasks. 
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Abstract -- This paper discusses the 
application of systolic array processors to 
Signal processing problems that are amenable to 
a matrix formulation. Systolic arrays are 
formed by providing nearest-neighbor inter- 
connections between a large number of elemental 
processors to form either a one - or two-dimen- 
sional array. With the possible exception of 
boundary elements each processing element per- 
forms identical computations in synchronism 
with other elements in the array. A number of 
important problems for which systolic arrays 
hold potential are mentioned and the systolic 
array processor definition, in a number of its 
forms, is reviewed. When applied to strongly 
band-limited matrices, systolic array 
processors can be characterized as highly 
efficient from the standpoint of both hardware 
utilization and algorithm time. However, as 
the bandwidth becomes large this high per- 
formance is degraded. In an effort to overcome 
performance degradation, this paper introduces 
and evaluates a data transformation which, when 
applied to an n x n dense matrix, results in an 
improved banded structure with attendant hard- 
ware savings. An interesting feature of this 
transform is its invariance properties with 
respect to the ordering of output. time 
sequences and algorithm execution time. 

Another interesting aspect is its relation to 
the classical Gauss-Seidel's method of 
iteration. 

It is shown that systolic array 
processors possess some efficient testability 
features which can be exploited concurrently. 
These are briefly summarized. 


1.0 Introduction 


This paper discusses the application of 
systolic array architectures to signal 
processing problems. 

Introduced by Kung [1], systolic array 
architectures provide the capability for 
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realizing a number of important matrix 
operations. In addition to achieving a high 
computation rate by means of pipelining and 
concurrent computation, the architecture is a 
good candidate for implementation with VLSI 
(very large scale integration) technology. If 
the matrices processed are characterized by a 
narrow bandwidth, excellent hardware 
utilization efficiency can be achieved. 
However, in those cases where the matrix 
bandwidth becomes appreciable, for instance in 
the case of square densely-populated matrices, 
hardware utilization efficiency is degraded 
significantly. This paper addresses the 
problem of using systolic arrays to process 
matrices whose structure is less constrained. 
A simple but effective data transform which can 
in some instances significantly improve 
hardware utilization efficiency is introduced 
and developed. 

The paper is organized as follows. 
Section 2.0 presents a brief and general dis- 
cussion of several problem areas where the 
systolic array architecture is of interest. 
Section 3.0 outlines the main features of the 
systolic array architecture and only summarizes 
the extensive treatment given in [1,2]; this 
section is included only for purposes of 
completeness of presentation. The PRT (partial 
row translation) data transform is introduced 
and developed in detail in Section 4.0. 
Section 4.0 also quantitatively compares the 
efficiency of the original systolic array 
processor with that which results from applying 
the PRT transform. These results provide a 
means for deciding when PRT is advantageous. 
Matrix inversion is the topic of Section 5.0 
while Section 6.0 briefly outlines an efficient 
technique that is useful for testing some 
systolic array matrix processors. 


2.0 Matrix Operations in Signal Processing 
Applications 


Matrix operations represent a significant 
portion of the computational burden encountered 
in many signal processing applications. 
Adaptive filtering, data compression, beam- 
forming, and cross-ambiguity calculation 
represent problem areas where stable matrix 
analysis techniques are of current interest. 

In terms of resources required for system 


implementation, these problems can be classi-. 
fied as memory intensive and computation 
intensive. Construction of systems capable of 
providing the computations required for 
analysis of the above problems must provide for 
such operations as matrix multiplication, 
inversion, addition and various decompositions. 

For example, in least squares approxima- 
tion problems, one might encounter matrix 
multiplication, matrix inversion, and/or 
Singular value decomposition. The computa- 
tional approach used in a particular instance 
depends upon the numerical stability properties 
of the problem at hand. For instance, if the 
order of a particular problem is sufficiently 
small, the Gauss normal equations might be 
solved by performing a straightforward matrix 
inversion. However, in the solution of 
ill-conditioned systems commonly encountered in 
large-scale problems, achieving a meaningful 
solution might require application of singular 
value decomposition computations. 

In [3] Speiser and Whitehouse discussed 
the signal processing problems mentioned above 
and considered the applicability of competing 
architectures such as transversal filters, 
array processors, bus-organized multiprocessors 
and systolic array architectures. Of these, 
the most promising architecture is that of the 
systolic array which has the potential to 
support real-time implementation of the 
algorithms required in order to address those 
problem areas mentioned in this section. 


3.0 The Systolic Array Architecture 


In the interest of a self-contained 
presentation, the systolic array architecture 
will be outlined and illustrated in this 
section. A thorough, comprehensive treatment 
can be found in [1] or in [2]. The systolic 
array architecture is founded almost exclu- 
sively upon a single computational element--the 
inner product step processor--which implements 
the relation 


k+l | k 
y e x 


e 


| k+l 7 ¥ > 
k=0, 1, 2, ..., nl. (1) 


Systolic array processors are constructed by 
appropriately interconnecting a group of inner 
product step processors. In the systolic array 
architecture, only nearest-neighbor processor 
communication is permitted. For purposes of 
data communication and computation, each inner 
product step processor is equipped with three 
data registers: a (for y), Rg (for ax) 

and Ry (for x,). Each register has two 
connections ~- one for input, the other for 
output. Kung [1] defined two types of inner 
product step processors which are illustrated 
in Fig. 1. These elemental processors can be 
connected in a number of ways which provide the 
capability to perform various matrix operations 
such as matrix multiplication, L-U 
decomposition of symmetric positive-definite 
matrices, and the solution of triangular linear 
systems of equations. 
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| A basic unit of time measure for both 
types of processors shown in Fig. 1 is defined 
as follows: (a) the processor loads inputs 
yk, xX, and a,, into Ry, R,, and Rg 
respectively, (b) yet is computed 
according to (1), and (c) y k+l x, and 
a, are output. 

As an example, a Soptotic array 

matrix-vector processor will be configured to 
form the product 


y = Ax 7 (2) 
using a linearly connected group of Type 1 


processors. The relations which must be 
implemented are as follows 


0 _ 
y 50: | (3) 
y, * y3 tS de 2. eee 


Fig. 2 illustrates the systolic array of 
processors, the element data arrangements and 
flow required to evaluate (2) for the case 
where A is an n x n matrix with bandwidth 

=p+q- 14. Definition of p and q are 
as follows: | 


p = max(j-itl), #0 for 


j?i 
q = max(i-j+l), aij # 0 for oe 


The yj enter the array from the right as zero 
and accumulate so as to form the inner product 
of the ith row of A with vector x which moves 
to the right after being input from the left. 
As the x and y vectors move through the array 
in the manner noted, A is shifted downward such 
that elements along the main diagonal pass 
through Pg. In general elements of A above 
and parallel to the main diagonal pass through 
processors to the left of P9. Similarly | 
elements of A below and parallel to the main 
diagonal pass through processors to the right 
of Po. A detailed example illustrating the 
operation of this systolic array matrix-vector 
processor will be presented in Section 4.0. 
Generalization of the linearly-connected 
systolic array to a two-dimensional 
orthogonally-connected structure enables the 
evaluation of matrix-matrix products. A 
systolic array for evaluating es 


C = AB (4) 


where all matrices are n x n is shown in 

Fig. 3. Matrix A is input to the systolic 
array in exactly the same way as described 
earlier for the matrix-vector processor while 
columns of B are input, with appropriate 
spatial shift to allow for A's time delay, into 
successive rows of the array. If B contains a 
large number of columns this implementation can 
be inefficient even for stronly banded 


matrices. Kung [1] overcame this problem by 
devising the hexagonal-connected systolic array 
which is based upon the type 2 processor of 
Fig. 1. An example of this processor is 


presented in Fig. 3(b) for the case (4) when A, 


B and C are strongly banded. Note the 
direction of flow orientation of A, B and C. 
Entries in C are accumulated as it is shifted 
upward from the bottom of the array, where the 
Cjj enter with zero value. 

Using the array structures presented 
above, Kung [1] was able to realize two addi- 
tional important matrix operations. Due to 
space limitations, these only will be mentioned 
here. A triangle equation solver can be con- 
structed using a linearly connected array of 
inner product step processors; however, it is 
necessary to introduce a new processor capable 
of division. The resulting processor solves a 
nonsingular triangular system of linear 
equations by back-substitution. Similarly, by 
adding special elements on the upper portion of 
the periphery of the hexagonal array (Fig. 3b), 
Kung [1] showed that one can obtain the 
following matrix decomposition 


A = LU 


where A is a symmetric, positive definite 
matrix, 
L is lower triangular having ls on the 
main diagonal, and 
U is upper triangular. 
Therefore, this processor, when coupled with 
the triangle equation solver, can be used to 
solve a fairly general class of simultaneous 
equations. 

Table 1 summarizes the hardware require- 
ments and algorithm execution time steps for 
the family of systolic array processors defined 
by Kung. When considered from the standpoint 
of hardware uniformity, a surprising degree of 
capability is realized by the systolic array 
architecture. For the case of strongly banded 
matrix structures, this architecture is 
efficient in terms of both the quantity of 
hardware used and in hardware utilization 
efficiency. However, if square dense matrices 
or matrices of more general structure are con- 
sidered, hardware utilization efficiency can be 
degraded considerably. This problem is 
addressed in the remainder of this paper where 
methods for improving implementation efficiency 
are introduced and studied. 


4.0 Definition and Development of the PRT 
Transform 


In this section the PRT (partial row 
translation) transform will be defined and some 
of the benefits available from its application 
in connection with systolic arrays will be 
presented. It will be shown to improve hard- 
ware utilization efficiency and in addition 
provide a hardware savings in the case of 
Square dense matrices. 


Definition of the PRT Transform 


Consider the matrix-vector multiplication 
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problem stated in (2) with A constrained to be 
n x n and densely populated. Express A as a 
strictly subdiagonal part, Ay, (i.e. with no 
diagonal elements) juxtaposed with Ay, the 
upper triangular part of A which contains the 
main diagonal elements of A. This may be 
expressed as follows 

Me 


_ 


A = 
\. 
eL (5) 
Applying the PRT transform to (5) provides 
Wa he 
~ be 0 
A me ‘. 
PRT <a 
N 
: s ALS (6) 
That is, Appr 1s obtained from A simply be 
translating (i-1) elements in row i to the 
right n positions within the row for 1 = 2, 3, 


, n. In the resulting n e (2n - 1) apeay. 
aii elements not specified by Ay and the 
displaced Aj are set to zero. Now, applying 
the PRT transform to (2) yields, the equivalent 
expression 


y=A x =A ES 
PRT PRT PRT Ky (7) 
where x, = (xj, x9, «++, X,-])- It 
is noved that the PRT converts a square array 


into a nonsquare array with enhanced banded 
structure. The transform necessitates 
augmenting x with a partial copy, x,. A 
detailed example where A is 4 e 4 is detailed 
in Fig. 4. Four processors are used and the 
required number of time steps is eleven. These 
quantities compare favorably with Kung's 
systolic array which would use seven processors 
and also eleven time steps. For n large, it 
follows that the PRT transform saves about n/2 
inner product step processors with no increase 
in execution time. If the original systolic 
array were designed such that immediately upon 
processing element a,,, the values of y 
contained in the array could be unloaded, a | 
time advantage would result for this processor 
configuration. The corresponding PRT based 
array, while saving about one-half the number 
of processors, would incur only about a 504 
increase in execution time. 

The PRT transform readily extends to the 
problem of evaluating the product of two square 
matrices as expressed in (4). It can be shown 
that the resulting systolic array for this 
problem is identical to that of Fig. 3a. The 
only difference occurs in the way A and B are 
input to the array. The PRT is applied to A 
which saves about n2/2 processors and the 
columns of B, input on the left side of the 
array are partially repeated as prescribed in 
(7). Due to the large number of connections 
which would be required to immediately unload 
this two dimensional array, the PRT configured 
processor will evaluate the matrix-matrix 
product without any time penalty compared with 
the original systolic array. 

Although they will not be discussed here, 


the PRT transform can be advantageously applied 


to some problems wnt nonsquare matrices are 
encountered. 


Quantitative Assessment of the PRT Transform 


The remainder of this section will be 
devoted to a quantitative comparison of the 
performance of the sytolic array processor 
proposed by Kung [1] (hereafter called original 
and denoted in certain instances by the 
subscript orig) with that of the PRT based 
structure (henceforth called alternate and 
denoted by subscript alt). The comparisons to 
be made will be based upon the following three 
figures of merit: 


(a) 
(b) 


Processor utilization efficiency 
Norig and alt- 
Space-Time product (ST) 
(ST) ait where 

S = number of inner product step 
processors 

T = number of algorithm time steps. 
Overall figure of merit F = 7/(ST), 


Q = Fait/Forig- 


orig and 


(c) 


In the comparisons which follow, no penalty or 
cost is assigned to implementing the PRT trans- 
form. Also it is assumed that n is large. 
First consider the matrix-vector problem 
which is shown for both processor configura- 
tions in Fig. 5. Adjacent to each processor 
configuration expressions for 7, S, and T are 
given. 7 is defined as the ratio of active 
area to the total area as shown in the figure. 
Simply stated it is an approximate measure of 
the proportion of algorithm time for which 
computations are performed. Only square 
matrices are considered here with bandwidth 
w= p+q- 1. Note also that the comparisons 
made here assume processor initialization as 
illustrated. 
Fig. 6 presents plots of as a function 
of the normalized bandwidth parameters y = p/n 
and x = q/n. This figure is drawn under the 
assumption that the array of the original 
configuration may be unloaded immediately after 
element apy has been processed. Alter- 
nately, Fig. 7 presents the same information 
except that immediate unloading of the original 
configuration is not allowed. The results show 
that the capability to immediately unload the 
array is important when x, y—~1.0. Note that 
the original configuration provides excellent 
efficiency for x and y both small, that is, for 
strongly banded matrices; however, as 
x, y—*1l.0 the alternate form is superior. 
Now consider a comparison on the basis of 
(ST) product. Solving the relation 
(ST)orig = (ST)ait provides the 
result plotted in Fig. 8. When the pair (7) 
lie above the curve, the alternate configura- 
tion provides a smaller (ST) product. | 
Generally it will be desirable to maximize 
the quantity F = 7/(ST) for a given problem. 
Therefore, Fig. 9 shows a plot of 
Q= Fait/Forig versus y with x a 
parameter. Given x and y for a particular 
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problem these results clearly indicate the 
preferred processor configuration. ~ 

Attention is now directed to the matrix — 
multiplication problem where it is required to 
evaluate C = AB when both A and B are nxn 
dense matrices. For the sake of simplicity, 
the general case of banded matrices will not be 
treated in this comparison. Three systolic — 
array CORE oUSReAORe will be considered. 


(a) <A PRT-based orthogonal ly-connected 
_ processor 
(b) The orthogonal ly-connected BROCE RAOE 
shown in Fig. 3(a). 
(c) The hex-connected processor presented 


in Fig. 3(b). 


The quantities of interest for comparing these 
three configurations (subsequently referred to. 
as configuration (a), (b) and (c)) are 
tabulated in Table 2. (Note in Table 2 that 
the double subscript on Q is interpreted to 
mean Qap = F3/Fp where a and b refer 

to the configurations listed above). From 
these results the PRI-based systolic array is 
seen to offer significant performance 
advantages with respect to configurations (b) 
and (c) under the conditions specified. 


5.0 Applications of Systolic Arrays to Matrix 
Inversion 


This section will consider both explicit 
and implicit methods for solving a given con- 
sistent set of linear equations. By explicit 
it is meant that the inverse matrix is made 
available to the user while implicit is used to 
imply that only the solution vector is deter- 
mined and made available. 

The hexagonally connected systolic array 
mentioned earlier can be used to explicitly 
invert a given symmetric, positive-definite 
matrix. The approach is discussed by Speiser 
and Whitehouse [3] and can be summarized as 
follows. First the L-U decomposition of the 
given matrix is formed using the hex-connected 
systolic array. Then using n appropriately 
interconnected triangle equation solvers, 

L7! can be computed. In this step the 

input to the array of triangle equation 
solvers, i.e. the known input vectors taken 
collectively, forms the, identity matrix. 

u-l is computed in a similar manner, and 
finally the inverse matrix is obtained by 
taking the matrix product u-ip-l, 

All of these steps can be implemented using 
systolic arrays. 

Implicit matrix inversion can be per formed 
in several ways, the most direct consisting of 
L-U decomposition followed by two executions 


using a triangle equation solver. That is, | 
given | 
Ax = b: A and b known 
LUx= b: LU decomposition step 
Ly = b: solve for y using triangle equal ton: 
solver. 
Ux = y solve for x using triangle equation 


solver 


This method, while it does not explicitly pro- 
vide A~! is generally more accurate than 

the explicit method which computes 

x = A-lp = u-l L-lb [4]. Other 

implicit techniques such as Jacobi's method, 
Gauss- Seidel's method and the successive 
overrelaxation (SOR) method [5] can be realized 
with systolic arrays. Implementation of 
Gauss~-Seidel's method is interesting because it 
is closely related to the PRT transform. 
Consider the equation Ax = b. Factoring A into 
the form A = D(L + I + U) where L and U are 
strictly lower and upper triangular matrices 
respectively (i.e., their main diagonal 
elements are zero) and D is a diagonal matrix 
D = diag (aj3), aj3 #0, i= 1, 2, 

-++, M. Jacobi's method of iteration can be 
written in terms of these definitions as 
follows 


ee neg ee (8) 
where Ly and Uj; denote the ith rows of L 

and U respectively. Implementation of (8) 
using either the original or alternate forms 
for systolic array matrix-vector multiplication 
is straightforward, only requiring insertion of 
zeros along the main diagonal and evaluation of 
the terms bj/ajj outside the array as an 
auxiliary computation. The equations defining 
Gauss-Seidel's method are as follows [5] 


ie - i) + b./a... 
i i i ai 
Le 2s enbg ns (9) 


Here the notation is identical to that in (8) 
except that in the term Lyxktl , 

xk+l represents only a partially filled 

vector (xktl, xktl, Rig 

xktt, 0, ...) which is "built 

up'' as the computation proceeds. Gauss- 
Seidel's iteration can be implemented in 
systolic array form by using the PRT transform. 
This is illustrated in Fig. 10 which shows that 
the diagonal elements have, been omitted and the 
terms bj/ajj are evaluated outside the 

array. Assuming that the computation is 
started with an initial estimate xk, it can 

be observed from Fig. 10 that xktl 

will be output and available for processing by 
the strictly subdiagonal elements L. (For a 
detailed example of this property see Fig. 4 
and note that in the present case 

xktl = y,, is output at time step 5. 

Note also that this value of yj is required 

in time step 6 for processing by a9], 

which in the present case is L7). Since U 
always processes a backdated estimate, it can 
be seen that the PRT transform, or some 
equivalent method, must be applied in order to 
realize Gauss-Seidel's method using systolic 
arrays. That is, unless the elements of L can 
be moved to the input side of the array where 
the x¥tl are input, the pipelining 

effect of the array prohibits implementing 
Gauss-Seidel's method. Therefore, the original 
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considered. 


form of the systolic array cannot, without 
modification, be used to implement Gauss- 
Seidel's iterative method. 

Note from Fig. 10 that Gauss-Seidel's 
implementation can provide extremely efficient 
utilization of processor capability. Processor 
utilization efficiency, starting at 83%, mono- 
tically increases toward 100% as the number of 
iterations increase. Although not discussed 
earlier when matrix-vector processors were con- 
sidered, a form similar to that shown in Fig. 
10 can be obtained for the problem y = Ax where 
Ais nxmwithn>m. For this case, input 
vector x is simply repeated the required number 
of times while the PRT transform is applied to 
successive m x m partitions of A. 

The SOR method of solution by iteration is 
very similar to Gauss-Seidel's method, the most 
important distinction being that the systolic 
array in this case computes the residual error 
which is then weighed by a relaxation parameter 
appropriately chosen to accelerate convergence 


[5]. 


6.0 Concurrent Testing of Systolic Array 


Processors 


Utilization of any functional device in 
realizing important system features ultimately 
leads to questions regarding reliability and 
maintainability properties. In this section 
interesting methods for externally testing 
systolic arrays for proper operation will be 
It is not practical to consider 
reliability features here; therefore, only 
issues related to maintainability, namely 
testability, will be considered. Only external 
methods for testing will be explored. 

Consider the systolic array for performing 
a matrix-vector product originally proposed by 
Kung [1]. Given the way in which the matrix 
rows pass through the processor array, a rather 
simple external test for proper operation of 
the array would be to augment the given matrix 
by adding two check rows--one at the top and 
another at the bottom. This is illustrated in 
Fig. 11 where the two additional rows must be 
identical in order to facilitate the check. 
Note from Fig. 11 that if no xj = 0 and no 
augmentation element is zero, each processor 
will be checked in the process of performing 
the matrix-vector product. The test is very 
simple since it requires only that y, be 
compared for equality with y,49. 

Two additional processors are required to 
realize this test. It is interesting to 
examine the cost required to implement this 
check in terms of added hardware and algorithm 
execution time. Let S represent the hardware 
required to realize a processor in the array 
and t denote the time interval required for 
each shift in passing the matrix through the 
processor. For an n x n dense matrix and using 
the product S (computation time) as a measure 
of resources used, then the efficiency is given 
by: | 
(S @ 2nt) without test 
[S€2n+2)t] with test 


1 - 2/n 


For n large, it follows that this is a very 
efficient test in terms of required resources. 

With respect to test effectiveness, 
however, questions follow with regard to fault 
coverage. If x is known to be dense and the 
augmentation does not use zero elements, the 
test will be good for detecting hard failures. 
However, transient failures represent a problem 
for this approach. 

The test method just described. can be 
applied to matrix-matrix processors, although 
comparison of more quantities must be made. It 
also follows that this approach is applicable 
to the PRT transform. Note for this case from 
Fig. 11, however, that for about n time steps 
no checks on the computation are performed. 
This can be overcome by additional augmenta- 
tions, appropriately interspersed, in the 
original matrix. 


7.0 Conclusion 


Systolic arrays represent a potentially 
‘important means for implementing computations 
involving large-scale matrices. The realiza~ 
tion of a general matrix-oriented computing 
capability that is founded upon a few standard 
modules using VLSI technology is appealing. 
However, as emphasized in [2, Sec. 8.2], 
minimization of wiring requirements (communica- 
tion costs) is a central problem in this 
technology. The PRT transform introduced in 
this paper can significantly reduce these costs 


for some problems. Of particular importance is 


the fact that these savings can be realized in 
some cases without increasing algorithm time. 
It has been shown that for n x n banded 
matrices the PRT-based systolic array and that 
originally proposed by Kung [1] are complimen- 
‘tary in the sense that when one is efficient, 
the other form tends toward lower efficiency. 
The PRT transform does not alter the original 
systolic array hardware definition. The 
time-ordered outputs are invariant under this 
transform - the only changes appearing in the 
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Fig. 1. Two Types of Inner Product Step 
Processors: (a) Type 1, (b) Type 2. 


order of accumulation of intermediate values 
before they are output at the array port(s). 

Solution of linear, simultaneous equations 
by iteration methods using systolic arrays 
results in an interesting interpretation of the 
PRT transform. The PRT or some equivalent 
transform appears necessary in order to apply 
systolic arrays to Gauss~Seidel's method or to 
the SOR method. | os 

A simple, efficient - though somewhat 
limited - testing technique was introduced for 
performing external concurrent tests on 
systolic arrays. This topic, as well as the 
others considered in this paper, is worthy of 
further study. 
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Fig. 4. Detailed Example of PRT Transform and 
Linearly Connected Systolic Array to 


Compute y = Ax. 
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MATRIX C 
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Fig. 3. Systolic Arrays to Compute Matrix 
Product C = AB: (a) Orthogonally 
Connected, (b) Hexagonally Connected. 
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ORIGINAL FORM 
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unloading 


= 2n, with immediate unloading 
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Fig. 7. Processor Utilization Efficiency 
Without Immediate Unloading. 
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Fig. 


Fig. Ll. 


ouTPUT 


Three Iterations of Gauss-Seidel's 
Method with Initial Estimate x* 
(External Computations Performed in 
Block Z). 
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A 


(AuemenreD Matrix) 


THESE ROWS AUGMENT INPUT MATRIX 


Test IMPLEMENTATION: COMPARE ¥y WITH Yago FOR EQUALITY 


Concurrent Testing of Systolic Array 
Matrix~-Vector Processor by Augmentation 
Method. 
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Table l. 
Algorithm Execution Time Requirements 
for Some Matrix Problems. 


Systolic No. of 
Array Problem Processors Algorithm 
Configuration Solved Required Time 
Linearly Matrix- w 2n+w 
Connected Vector 
Array Mult. 
Linearly Sol. of W 2ntw 
Connected Triangular 
Array System 
Orthogonally Matrix- nM 3n+M 
Connected Matrix 
Array Mult. 
Hexagonally Matrix- WAWRB 3n+M 
Connected Matrix 
Array Mult. 
Modified L~U De- pq 3n+m 
Hexagonal ly composition 
Connected A = LU 
Array 
Note: (a) Matrices are assumed n x n with 
bandwidths w = ptq-l. Subscripted 
w denotes bandwidth of indicated 
matrix. 
(b) Matrix-Matrix Multiplication either 
C = AB or C' = B'A'. 
(c) M = min(wa,wp), m = min(p,q). 

Table 2. Comparison of Systolic Array 
Configurations for Matrix-Matrix 
Multiplication (all matrices n x n). 

Quantity of Configuration 

Interest (a) _ (b) (c) 
T 5n. 5n. 4n 
S n2 2n2 4n2 
7 2)3 1/2 1/8 

Qab & 207 

Qac = 17 

Note: Tables 1 and 2 do not reflect processor 


inactivity (see [2, Sec. 8.3]). If 
these effects are considered and the 
lower bound on S is used for configura- 
tion (c) (see [2, p.305]) one obtains: 
Qab ~ 2.7 and Qan = 8. 


Summary of Systolic Array Hardware and 


PARALLEL PROCESSING OF 


THE KALMAN FILTER 


Angus Andrews 

Science Center 
Rockwell International Corporation 
Thousand Oaks, California 91360 


Abstract — This paper presents a pipelined 
mechanization for a multiprocessor system to 
implement the Kalman filter equations for adding 
process noise and updating the estimates after 
observations. These parallel algorithms use the 
UDL decomposition of the covariance matrix of 
estimation uncertainty. Parallelism can be used 
within the pipeline to further increase processing 
speed. The fastest method for the n-state filter 
requires O(log n) add-times per scalar update. 
Serial methods require 0(n2) multiply-add-times 
per update. 


Introduction 
The Kalman filter is an optimal linear esti- 
mator that was introduced by R.E. Kalman in 
1960 [1]. It provides a real-time mechanization 


for estimating the n-dimensional state vector x 
of a discrete-time linear gaussian system 


Xe = OXp_y + Guy (1) 
given m-dimensional observation vectors 

Zp = HEXp + Vy (2) 
and the covariance matrices QV: Ry of the gaussian 


processes fu,}, {v,}. The estimate x is updated 
by the following well known formulas: 


Xp t= Xy + K (zp - HX) ) (3) 


K, t= P H) (H pH! 


-1 
Ki PHL CHL PLHY + Rp) (4) 


Pp t= (I= KH, )Py (5) 

Xe = SpagXy | (6) 
7 T T 

Pray = Mea PE Peay * GQ Gy (7) 


where Py is the covariance of estimation uncer- 
tainty. * Update of the estimate following an 
observation is mechanized by Eqs. (3-5), and 
propagation of the estimate in time is mechanized 
by Eqs. (6-7). Although these equations describe 
the theoretically optimum linear estimator, they 
are not necessarily well suited to numerical 
implementation in finite precision. Schlee [2] 
and others have observed numerical instability of 
these equations, and much of the subsequent work 
by Joseph [3], Schmidt [4], and others has been 
directed toward more accurate and efficient 
algorithms for mechanizing these equations. 
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Potter [5] introduced the idea of using a 
square root of the covariance matrix in the 
algorithmic implementation. This is a matrix 


L 
5 =P? such that p-=ss' . (8) 


The advances in "square-root" filtering up to 
1971 have been summarized by Kaminski, et aZ. [6]. 
Subsequently, Agee and Turner [7], Carlson [8], 
and Bierman [9] have introduced strictly algo- 
rithmic approaches to square-root filtering. This 
paper shows how a variation of the Bierman algo- 
rithm can be implemented in a pipeline architec- 
ture, and how the same architecture can be used 
for adding the "process noise" covariance Q. of 
Eq. (7), using the Agee-Turner algorithm. 


Observation Update 


The Kalman filter is a recursive algorithm 
for updating the estimate. Equations (3-7) must 
be implemented at each recursion step. Agee and 
Turner introduced the idea of using a recursive 
algorithm for implementing Eqs. (3-7) as well. 
They also introduced the idea of using a UDL (or 
LDU) decomposition of the covariance matrix in 
place of the square root decomposition. This is 
a decomposition of the sort 


p = upu! (9) 


where D is a diagonal matrix and U is an upper 
triangular matrix with I's along its main 
diagonal. This factorization does not require 
taking scalar square roots. Bierman derived a 
recursive algorithm for implementing Eqs. (3-5) 
in terms of U and D, rather than P. It assumes 
that the covariance matrix R of measurement 
uncertainty is a diagonal matrix. This form can 
always be obtained from the UDL decomposition 

of R. If 


(10) 
is such a decomposition, then for T=T7? the 


alternate observation z:= Tz with alternate 
observation sensitivity matrix H:= TH has a 


_diagonal covariance matrix of observation uncer- 


tainty R:= A. When R is diagonal, each component 
of z can be processed serially as an independent 
scalar observation. It is this feature that 
allows the update equations to be pipelined. 


The following is a variation of the Bierman 
algorithm. It performs the update mechanization 
on U, D, and x, given z, H, and R. » 


for j:=1] step 1 until m do 


begin 
y?=z(j)3 (11) 
wi=R(Jod)3 
for k:=1 step 1 until n do 
begin 
s?=H(j,k); 
y =y-s*x(k)5 (12) 
for i:=] step 1 until k-1 do 
s:=stU(i,k)*H(j,7); (13) 
d:=s*D(k,k)3 (14) 
K'(k,J) t=d (15) 
a :=W; 
w=wts*d; (16) 
c:=-s/a; (17) 
D(k,k) :=D(k,k)*a/ws (18) 
for i1:=] step 1 until k-1 do 
begin 
u:=U(i,k); 
U(i,k) s=utc*K' (i,j); (19) 
K'(i,j) =K'(i,j)+u*d (20) 
end 
end; 
yy =w/y3 (21) 
for 1:=1 step 1 until n do 
x(i) s=x(i)t+y*K' (i,j) (22) 
end; 


The reader is referred to Bierman's book for 
a proof of the validity of the algorithm. 
Bierman's algorithm performs Eqs. (12-13) in 
separate loops. The modification nests the 
do-loops so that all computations sweep from left 
to right across the columns of U and D. These 
computations involve only one column of U at a 
time, and this feature allows one to pipeline the 


While the pth 
updated with information from the qth 
R, H, and z, the (p-1) th 
updated with information from the (q-1)th rows 


of R, H, and z. Simultaneously, the (p-2) th 
column of U can be updated with information from 


the (q-2) th rows of R, H, and z, and so forth. 
This data flow is illustrated by Figure 1, which 
shows how the different data from R, H, z, U, D, 
and x comes together for arithmetic operations in 
the above algorithm. 


process. column of U is being 


rows of 
column of U can be 


Figure | is not meant to imply a particular 
multiprocessor architecture, but merely to 
illustrate the relative data flow as viewed from 
the matrix U. The diagonal of the matrix D is 
shown replacing the main diagonal of U. During 
the arithmetic processing, the transposed rows of 
H flow through successive columns of U from left 
to right, along with the partially computed 
columns of the unweighted Kalman gain matrix K'. 
The associated scaling coefficients w are computed 
in the flow down the diagonal. By adjoining the 
estimated state X and scalar observation z on the 
right of the U-D matrix, one can perform the fol- 
lowing equivalent form of Eq. (3) in the last 
column: . 


X t= x+K'D (z- Hx) | (23) 


and Wys Wor °o%s Wh is the order of arrival of 


weighting coefficients in the last (x) column. 


The data flow in Figure 1 suggests a pipe- 
line mechanization in which the data in the 
columns of U, D, and x remain within the pipeline 
and the rows of R, H, and z flow through the 
pipeline. This arrangement seems a practical 
one, because U, D, and x are the "permanent" 
variables of the Kalman filter and R, H, and z 
are only temporary data. Also, U and D are 
usually considered "nuisance" parameters in that 
they are necessary for the processing but other- 
wise Of no interest. Therefore, they can remain 
within the pipeline without need of access from 
outside. The estimate x is available at the end 
of the pipeline. 


In this pipeline mechanization, the essen- 
tial pipeline processor element is associated 
with a column of the U-D matrix or x. The arith- 
metic processes involving a column of Figure 1 
must be completed before the results are avail- ° 
able for the next column. The order of execution 
of arithmetic computations involving a row of H 
and a column of the U-D matrix or x is given in 
Table 1. This mechanization could be implemented 
by using one processor for each square in Fig- 
ure 1, much like the "systolic" processor arrays 
of H.T. Kung [10]. The required data flow 
between processors in each column is shown by the 
vertical arrows in Figure 1, Such a mechaniza- 
tion would not make efficient use of the proc- 
essors, however. It would require n process- 
times to implement Eq. (12), and n process-times 
for the w-coefficient in Eq. (22) to make its 
way to x(1). Therefore, it would require O(n) 
processors to perform O(n) arithmetic operations 
in O(n) time, and the ree processor utiliza- 
tion factor would be O(n 


The arithmetic processor utilization can be 
improved to 0(1/log n) and the update execution 
time shortened to O(log n) by using an array 
processor to form the dot products of Eqs. (11-13). 
All multiplies can be done simultaneously, and 
n binary adds can be performed in logon add-times. 


Alt other arithmetic operations do not depend 
upon n. The fastest serial methods for the 
observation update require 0(n*) multiplies and 
adds, for either the square-root [9] or conven- 
tional [3] methods. 


R33,.Re2.R11 
Hmi,..., H3a1,He1,Hii 


H32,H2e.Hi2 


HmM3,..., H33,H23.Hi3 
HMN, 2. es H3n,. Hen, Hin 
2M ,..-, 23.22 ,41 
Figure 1. Relative Data 
Table I. Execution Order 


Equation Numbers 


| columns 1—n | Column (nt1) | 


| Order of 
| Execution 


Filling the pipeline requires O(log j) arith- 
metic operations for computations in the jt 
column. Filling n columns then requires 


(25) 


n 
0 (3 log 7 = O(n logn) 
j=l 


arithmetic operation times. Therefore, when one 
adds the time required for filling and clearing 
the pipeline, processing m independent scalar 
observations requires 0((ntm)log n) arithmetic 
operation times, 


Adding Process Noise 


The variate u, of Eq. (1) is called "process 


noise." The Kalman filter uses the covariance 
matrix Q of process noise in Eq. (7). Following 
an argument similar to that used for R, one can 
assume that Q is a diagonal matrix. (If it were 
not, then the factorization of Eq. (10) would 
lead to an equivalent formulation with Q:= A 
diagonal and G:= GT.) In that case, 


Flow in Observation Update 
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T - T 
aga = 2 4,9;9; (26) 
i 


where q. is the jth diagonal element of Q, oF is 


the yeh column of G, and p is the column dimen- 
sion of G. Therefore, it suffices to be able to 
perform the operation 

Pi= P+qgg! (27) 


in terms of U and D. This is done by the follow- 
ing algorithm, due to Agee and Turner [7]: 


for j:=1 step 1 until p do 


begin 
e:=Q(j.35)s 
for k:=n step -1 until 1 do 
begin 
d:=D(k,k); 
D(k,k) s=dte*G(k,j)+t2; (28) 
for i:=k-1 step -1 until 1 do 
begin 
G(i.j):=G(i,J) 
-G(k,j)*U(i,k)s (29) 
U(i,k) :=U(i,k) 
te*G(i,5)*G(k,j)/D(ksk) (30) 
end; 
e:=e*d/D(k,k) (31) 
end 
end; 


The data flow for this algorithm is illustrated 

by Figure 2, which shows how the data from U, D, 
G, and Q come together for arithmetic operations. 
The arithmetic operations of Eqs. (29-30) are to 


GiJ 


—~G3J 


GNI 


QsI 


Relative Data Flow for 
Adding Process Noise 


Figure 2. 


be performed in the off-diagonal rectangles, and 
those of Eqs. (28,31) in the diagonal rectangles. 
This process allows the same type of pipeline 
structure as the update, but with the direction 
of data flow reversed. 


In this case, there are O(n) arithmetic 
operations required in the longest column of 
Figure 2, but they can be performed almost in 
parallel. The exception is that the factors 
G(k,j) and e*G(k,j)/D(k,k) must be computed in 
the diagonal squares before processing can start 
in the off-diagonal squares. Assuming that these 
data can be broadcast to the off-diagonal squares, 
the processing in the longest column could be 
completed in 0(1) arithmetic operation times. 
Therefore, the minimum time required to add 
process noise covariance and clear the pipeline 
is O(ntp) operation times. 


Remarks 


One can show that it requires 0(n?/log n) 
arithmetic processor elements (APE) to attain the 
ultimate observation update speed in a multi- 
processor system. Updating the j&" column of the 


U and D matrices requires 0(j) adds and multiplies. 


Therefore, 0(n*) adds and multiplies are required 
throughout the pipeline for updating the n col- 
umns, and performing these in O(log 1) operation 
times (the ultimate speed) requires 0(n?/log n) 
APEs. 


The time update is probably the greatest 
computational problem for general Kaiman filter. 
The state dynamical equations (6-7) contain terms 
for deterministic (6) and nondeterministic (Q) 
dynamics. This paper has considered only the 
latter. The several serial solution methods which 
have been proposed for the triangular square root 
time update are mostly borrowed from square root 
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least squares methods, such as Cholesky decomposi- 
tion [11], modified Gram-Schmidt orthogonaliza- 
tion [4,12], and Householder [13] and Givens [14] 
transformations. Unfortunately, these presuppose 
formation of the product ®U, which requires 0(n°) 
arithmetic operations. The square root algo- 
rithms of Morf and Kailath [15] satisfy 

Eqs. (4,5,7) simultaneously, and only presuppose 
the products HL, OL for 


Lg = ee a) (32) 


It is often the case that Ly can be maintained 


with column dimension a << n, which would reduce 
the number of arithmetic operations to 
O((mtn)na) prior to triangularization. (These 
methods require that ©, H, R, G, and Q be con- 
stant, however.) Therefore, the fastest known 
parallel update with 0(n*/log n) APEs would be 
expected to require at least O(a log n) arith- 
metic operation times for the time and observa-~ 
tion update. 
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ON THE REARRANGEABILITY OF A (2log N-1) STAGE 
PERMUTATION NETWORK* 


Kyungsook Yoon Lee 
Department of Computer Science 
University of Illinois at Urbana-Champaign 


Urbana, 
Abstract 


Both the rearrangeability proof and the 
control algorithm are well known for the Benes 
network which is intrinsically symmetric. How- 
ever, there has been little progress for the case 
of nonsymmetric networks of similar hardware 
requirement. 

We provide a proof on the rearrangeability of a 
(2log N-1) stage network. Our proof does not 
depend on the symmetry of the network and can be 
applied to nonsymmetric as well as symmetric 
networks. We develop a global approach, one 
advantage of which is that it leads naturally to 
the idea of the rearrangeability proof and a 
control algorithm. For ease of understanding and 


presentation, the reduced eae is chosen to show 
the proof method. The 0, *-passable permutations 


are first characterized and the bit control 
algorithm emerges as the 'natural' control 
algorithm for such permutations. By a simple 


reinterpretation into (eng of the 0," -passable 


condition, unique control algorithm for the re- 
duced Vag to transform an arbitrary permutation 


= : , | : 
into an oe -passable permutation is obtained. 


The hardware requirement of the reduced 
2,057 is (Nlog N-N+1) switches which is the lower 


bound for rearrangeable networks. Though our 
algorithm has the same time complexity of 
O(Nlog N) for single control and O0(2N) for multi-~ 
ple control as the looping algorithm of the Benes 
network, it is simpler because calculations of 
inverse mappings are not required and it is easier 
to understand because the switches are set stage 
by stage. 
1. Introduction 

An interconnection network is rearrangeable 
if its permitted states realize every assignment 
of input points to output points [Bene64]. Our 
primary concern here is N = 2” input/output inter- 
connection network used as a permutation 
network. 


The Benes binary network (we will simply 
call it the Benes network hereafter), a member of 
Clos' three-stage networks [Clos53], is a rear- 
rangeable permutation network with a well-defined 
control algorithm [OpTs71], and it requires near- 
optimum hardware [Waks68], [Joel68]. While the 
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Benes network was proven to be a rearrangeable 
network long ago [Bene64], very little has been 
known about the rearrangeability of other multi- 
stage networks of the same hardware complexity. 
The topological difference--symmetry in the Benes 
network and nonsymmetry in the multistage 
networks, symmetry meaning that the left half and 
the right half of a network are the mirror images 
of each other--may well be the main reason. The 
Slepian-Duguid theorem [Bene65], [Bene75] on 
which the rearrangeability of the Benes network 
is founded, inherently applies only to symmetric 
networks. 


In this paper, we provide a new method to 
understand and prove the rearrangeability or the 
universality [Sieg77] of (2log N-1) stage net- 
works. In this method the first (log N-1) stages 
(FH) and the last log N stages (LH) of the net- 
work are controlled by two different control 
algorithms. All the permutations realizable on 
LH are characterized in terms of residue classes 
regarding the input permutation as on ordered set. 
The LH is controlled by the usual destination tag 
method [Lawr76], [Pate79]. To transform an 
arbitrary permutation into a LH-passable permuta- 
tion, FH is controlled by a proper residue class 
partitioning. Though we shall use a symmetric 


network 2 in the proof for ease of under- 


standing and presentation, it is emphasized that 
the proof does not depend on the symmetry of 
the network. 


Definitions and notations are given in 
: aL ies ‘ : 
Section 2. The Se -passable condition is derived 


in Section 3. In Section 4, the omega network Se 


[Lawr76] with its last switching stage removed (we 
will call it the reduced 2) » is shown to be able 


e e -l 
to transform any permutation into an en —passable 


permutation... The reduced cathy control algorithm 


and more hardware redundancy are discussed in 
Section 5 and finally, in Section 6, the conclu- 
Sion and the extension are described. 


2. Notations and Definitions 


Def. A Complete Residue System modulo m, CRS 
(mod m), is a set of m intergers which con- 
tains exactly one representative of each 
residue class mod m. 


A Complete Residue Partition, CRP is a parti- 
tion of a CRS(mod 2m) into two CRS(mod m). 


(m-1) 


For a given CRS(mod 2m) there are 2 different 


CRP's,. 


; We consider a N input and N output (N= 2) 
interconnection network consisting of finite 
number of switching stages and fixed connections. 
In this paper, we study only the case where each 
switching stage is N/2 of (22) switching 
elements. We can represent a n-stage network V as 


y= CR clel (., cla Dyg@-Don 


where C denotes a fixed connection and E denotes 
a switching stage and the superscript i specifies 
the i-th stage. 


In particular, the inverse omega network [Lawr/76] 


fie can be represented as 
= EVE U... E 


where U is an unshuffle. 


(n-1),, 


We shall use i aes an ordered set of N in- 


L 
put numbers, (a Ags ay 1)?» which is the oe 
permutation of the seni stage E> An Ss is 
shown in Fig. 1. Throughout the paper we will be 


interested in the CRS properties of various par- 
titions of At. 


Def. 
Ty, sig sk n 
Ag yotayt 0<k<2 
i af (n-j) 
= U 
aay ACs_1) 2k 0<k<2 » L<j<n 


L 
“G-D, (24)? 


(See Fig. 2 for the pictorial representation of 


these partitions for n=3.) Thus fa, _losk<2 8-4} 
forms a eee ee of A* = {ap |O<k<2"}. By 
the definition, 
i i Ss 
= U <Q< <s<j 

AS yk F {Any (291H8) 10S ae er Rees 

= Da ReacUT eng 
4 fay ge ee: (2) 


ven 1),k 


By sedate has m), we mean that A consists 
of two elements x and y, congruent to p and q 
(mod m) respectively. 


The relation between i and a Citd) 
(i+1) 


By A 


is decid- 


ed by the effects of E and U. = A“ EU, 


we mean that a (tt) 

by E and then by U. 
i , 

{a5, E 

(q(t) | _ (itl) 

. ae 

PAC i) willow See 


is obtained from A permuted 
re AD. alee cates, 


i 
E}= fap, aaa , 


eal 
)= (ay, E, 


i 
then Batty 


i 
aon41 BF) > 


(n-1) 


and. so a },0<k<2 


(3) 


An example of this relation for n=3 is given in 
Fig. 3. 


thus, ig ASM) J atey, o<icn , 
i (i+1) (i+1) 
th = 
Se ey ae Gey apt) ee 


L<jsn, o<k < 2d) 


ZL2 


(Fig. 4 shows the relation for n=3.) So, 


i (its) Ss 
=U 4 <R< 

Ae AG ee cGaue Be 4))|Os2 ae oe (5) 

L<j<n, 0<K< 2d) O<s<j. 
In particular, 

i 0 
Ay ak AGH) sk 

Def. An i t sequence Ale (ay sa ° Sy ) is 

. An ane quen Ayeess8ry yy 


an on 1 pasaable eae ce a, =k, 


pa when An =A “(hs 


The -control algorithm is as follows: 

each switch setting in the i-th stage is con- 

trolled by the i-th bit of the upper input in 

binary representation. If the control bit is 0, | 
then the switch is set straight. If the control 

bit is 1, then the switch is set cross (Fig. 5). 

Refer to Fig. 6 for an example of 9%," -control 
algorithm for the destination permutation (7,2,4, 
1,3,6,0,5). 


3% o* -Passability 


In this section, we shall prove Theorem 1 on 


-1 
hy 


eee -l 
Lemma 1 describes a characteristic of the oe _ 


-~passability through Lemma 1 and Lemma 2. 


control algorithm. The 9" -control algorithm 


works by sorting bits--if certain input condi- 
tions are satisfied, the lower i bits of the in- 
put numners to the i-th switching stage E+ are in 
ascending order, and moreover, the (it1)-th bits 
of two numbers that share a switch are distinct. 
An example of Lemma 1 is shown in Fig. 6 for n=3. 


Lemma 1 
If A, x 18 a CRS (mod as), 1<j<n, O<k<2 “3) 
sien’ caf ! 
OP ea ik O<i<n, where gt is controlled 


by the 9," control algorithm, then AY tp Pt?" } 


(it+1) Pere 
(mod 2 ) for any AY 7 Heo ,O0<p<2” , 
1<i<n. | 
Proof By induction of i. | 
a ee ; ig | 
Suppose that Ay g=ta,qt2” }(mod 24 5 | 


for any AD Q CAP C441) ¢ HD) 


(n-i) ,q’. Lik (n-(i+t1)), p’ 
Then 9 (a- (it+1)-1) (n=C441)-1) . (oy) 


O<p<2 (tt?) | 


Let A 


° p<k<2 


3 


(n-2) 


Case 1 O0<k<2 


y(n-(4H1)-1) |< (n-2) 


» and so p<2>. 


ings at 4 a Proof By induction of i. 


°F 
isk | Sak) #2) Note that Af) = (47) a? } = 
(n-i-1) (n-i-1) 1,k ia ae , » (n-1) 
2 *p<2k, (2kt1) < 2 *(pt1) ae ace ct 
i i i {k,kt2°‘"°"7} = {ky kt+2‘" 7} (mod 2%). 
Cc bd > 
PRCE CRORES, Dir PO)? ae pe oe (i+1) 
i i nb (i+1) Suppose that A = Ldedt2 } 
A ok ={p,pt2°>}(mod 2 4 
1,2k? “1, (2k+1) ae (i+2) a (i+1) . Let 
at 4 2 (i#1) (mod 2 ) for any A CA : 
Thus, 7 (2K)! = (2ic+1) E Pp ¢ . P ° 
A Ok og then ot). p<ke2 BAD) ary, 
(since p<2> ). However, they must be dis- dak (n-i),p i eas 
‘ + i i af i ict i+ 
tinct, (mod ghana). since a eG ok O<p<2°. Now A, 7 145,24 (21) 3 {a y(n)’ 
ty ) + : 
i A? nd i+] itl (141 +1 
ve ate (2ket1)? = Ay, i Beg) 4g by -(o).-an as ) eA a oe and a‘ ody EA att ) where 
0 (i+2) wT) ppg (2) 
hoo Lee CRS (mod 2 ye “Henee,; Grecia) 
ED »k x yD (442) 2 is given by k ¢€ ee Since Z : ° 
eb L -(j = 
Gace.) pee 9 (n- 2) cy (n- (1+1)- ce ass +2 Fey, therefore re Gin} Cs 
ttc ‘card 2 si ° > 
9 (n-2) 2 (WF), and so TT wad oD rece) Jie 
(HL), a8 714 (n-(itl1),p) ’ 7 (n-2) n-(itl), (p42 ) 
aan k Corer )+1 re a5 (2K Fle ae Pewee 
(et) Hence, by the induction Besenceae. 
ea ee (ecnGs aos pee hie 2 (it2), and 
n-i- i n-1- ’ 
2 (a2 JS Ail) eZ f . +49 
Ee Sit) : = {prot ppatee Gog 5 {t ay. 
=(p=Z tl) 1, (242 )) 
i 1 cat Tt. i+] i+] 2g 
“rjak'? “2, (2k'41) “A@m-2), ee) and so a**!) ssl 20) and al o Dy” 
i at = oi (k+2 
Therefore, A ei ={p- 2° ,p} 
me 1,2k b(2k 41) 542" God 2°41), 
i 5 
nee ). ; ‘ Thus, A = {p,p+27>} (mod 2 (ttt), 
spies ft ae tisk Q.E.D 
Thus 3 ee a OR 4 
p (mod 9 (Et1), (eipek pte” aaa Theorem 1 " -passability: 
(i+1) er coe, - Bacteria OL 4.0) —.0 he 
2 *<p<2 ». Again A Lk AD (2k'41))> An input sequence A @qrAporres4y 1), is 
AS ea mt? and as tidus i! isa Ty “passable iff Ay 7 is a CRS(mod 24), 1<j<n and 
ie cas ° gtard) 
| (i4+2) O<k<2 oe 
CRS (mod 2 ds Proof 1 We use the Oye + eontrel ok: 
Se ae ve2 _ (n-1 n-1L; 
Se = 1{p pt tage 2 dy switch settings. Then Lemma 1, A Lek {a Boy. j 
n-1) in (n-1) a 
Since it can be shown in the same way as above ary fk see2 ee 0<k<2 (n- in Hence, ow 
2 1 i =i ai 
that A ic = {P»Pt2} (mod 2°} for any Oe Vian ere me ve Cw ~1). qo). ae ae 
O<p<2 as the induction basis, the lemma is set (BO ~1)’ nee 
PEOVEGs Q.E.D. a rts, o<k<g B-D) Thus apek, O<k<2" and the 


2 t r i at 4a 4 
sista States that for any input sequence output sequence is on -passable. Only if: By 
that is {1 “~passable, the lower i bits of the in- 


2 . f 2 : L 
put numbers are in ascending order and the (i+1)- Lemma 2, AL ik ={p,pt2>} (mod git for any AL aie C 
th bits of two numbers that share a switch are ; i ; 9 
distinct for all the intermediate stages i. This Ant) ee O<p<2” and l<i<xn-2. Consider A, 1,k ota, 
indicates together with Lemma 1 that the ot. 7 = 1 1 1 
' a Ags yard a Ay pas Gane (n-2) 
control ereor tm the ‘natural’ control ee 1) kt? * , (L4+2 “ 


algorithm for wy 1_passable permutations. 


" “2 
with & given by k € {2£,22+1}. Then oe (n-2) 
Lemma 2 Gs 1) 1 1 

ete Lsgita i~ (since k<2‘°°~"} and AY , CA , ADE 

If av=k, O<k<2” and peg Dy ches te ye ag” eT) igs 

: 1 
ae of a particular switch egal of E', {0,2} (mod oy; ad A G22) io Seay | 
(441) ot 1, (R42 ) 
={p,p+2" (mod 2 ) for any AG : 
1,k 1 en area, p’ L ~ £4 qa9ke 2 
; A = 41.142 -tnod.<2~ )s 

0<p<2t and 1<i<n. 1 (242 62-2) 5 


1 
Thus a, = O(mod 2), a = 1 (mod 2) 
x (k+2 (n-1)) 
0 : ee oO | 
and Al k is a CRS(mod 2). For j>l, AD ae 


(j=1) Gal) Sat ond) qk) 
ae , p(L) =kt2Q2 » O<k<2 ; 


Since 2 G-D-D vocpcgycn® GYD, gary 
(j-1) (j-1) 

so eo ere ah) 

Tt follows that A k 


> 


and a eel e 425 th (mod 2d ). 


for mie is a CRS(mod 23). 
Obed 
As an example of Theorem 1, consider a permu- 
tation, (3,4,5,6,7,8,9,10,11,12,13,14,15,0,1,2), 
which is an uniform shift of distance 3. It is 


07 ;-passable because 


(3,4,5,6,7,8,9,10, 11,12,13,14,15,0,1,2) 


| eerie 
a CRS(mod 8) a CRS(mod 8) 


: eee PCR ITOD! (f, MORORME ioener tina i RCC Ree Ma A fe NAVE Det aera 
both CRS(mod 4) both CRS(mod 4) 


[Seeeeeon | enreee | DMNP=s | PSR pl De oy Ceramics COMER! FE, (aterm 
all’ CRS(mod 2) 
The switch settings for the same permutation for 
n=3 is shown in Fig. 7. As another example of 
Theorem 1, a shuffle permutation AO Bo g9.2610,3. 
11,4,12,5,13,6,14,7,15) is not QF passable hee 
cause 


(0,8,1,9,2,10,3,11, 4,12,5,13,6,14,7,15) 


fee Ne eae a) Pet age ie i ne 
+ a CRS(mod 8) + a CRS(mod 8) 


Usually network passability has been defined 
in terms of the bit telations between the source 
tags and the destination tags--a tag meaning the 
binary representation of a number [Lawr/76], 
[Peas77], [YeLa80]. In Theorem 1 the Q-1-passable 
condition for an input permutation is given in 
terms of CRS, which is an easier tool to deal 
with. We can see easily that certain permutations, 
identity and uniform shift, are -1_passable while 
certain permutations shuffle unshuffle and bit- 
reversal, are not {0 Ae aegab le; Clearly, identity 
permutation satisfies the CRS properties of 
Theorem 1 and the uniform shift preserves the CRS 
properties; while shuffle, unshuffle and bit- 
reversal violate the CRS properties inherently. 

As another example of the application of Theorem 1], 

the percentage of O51 passable permutations among 

N! permutations can be easily calculated by 

Theorem 1 and the use of combinatorics. 

Def. The oC" passable . condition for an input 
permutation eC en ve D is that AY 


j,k 
is a CRS(mod a4), for ee ie o<k<25, 


3 


The oC passable condition which is defined over 


Ld 0 
the original input sequence A , can be transformed 
into a condition for intermediate input sequences 


A” in the following way. 


i 1 i i 
I et BR. = b OQ < g } ¢ A 
k { k, 2 as j 


224 


i a (n-i) 


db, = (nani) > OSE? 
” (cho 2 a 


Thus Bre: in addition to AC 
Ay 
‘i, 
(see Fig. 8). 


k 


Lemma 3 


i,k 


-partition of A’. However, in contrast to 


, l<i<n . 


"s, provide another 


5 i 
k? elements of B, are scattered all over A 


The Oye Sa aeaule condition is equivalent to 


the Sor that each B. is a CRS (mod ge 


(n-i) . 
O<k<2 ke <a 
Proof By (5), Ay =U {am (a1) LO<e<277 = 
. : 2 OO, CUk+%*2 ) 
Bre 6.2: 


‘ _ 
4. Rearrangeability of the Reduced M0" 


We show that any input permutation can be 


; =A. F F 
transformed into an {2 "-passable permutation via 
the reduced oN network. We only reverse the 


: ~l 
direction of input and output in te , and con- 


sider it as Q, with input AD Sad ourpue Ae has 


enables us to use all the relations derived in 
the previous section without any change. Now 


k 
in the relations 
1 i ae 
Doi 2? P(aretr) QE PK, 
i i (i+1) 


B 


U <k< 
Pe POR ke | Oe 


» Db 


n= 


1 4 e - 
B,'s are natural partitions for a as can be seen 


(i+1) i 
k, Gee 
' (8) 
(1) O<i<n , 


which follows from (7) and the unshuffle effects. 
Tf we consider the inverse of E as an wr control, 


then (8) leads us to the needed control for the 
reduced ae in virtue of Theorem 1 and Lemma 3. 


Let o denote a shuffle permutation [Ston71]. 


Def. An exchange permutation ES is defined by: 


(it+1) 


if Aa OE, and if every By is a 


CRS (mod gfith), | then every B. is 


CRS (mod oY: 


k 


Now {E,|0<i<n} is the one and only control we 


have been looking for by the following theorem. 


Theorem 2 


{E, | 0<isn} is the one and only kind of 


control to a aly an arbitrary permutation, 


a a ai fs 1 


: 0 0 
permutation A = (agsaporees 


wee 


0 


) into an 0+ passable 


ayy via the. reduced 


Proof 
n  n, n 
Since B. =A is always a CRS(mod 2), by the 
definition of Ee it follows that each Be 
o<k<2 9-1) | 1<i<n is a CRS(mod 2 dy when A i_,(i+1) 
i 0 0 0 0 

O ; = een i 

ER Hence, by Lemma 3, A (ap »ay > an) is 
an Oy Pies Bath tee Paar Conversely, any such 


pre must be ED 


R 
3 and (8). The fact that 0<i<n, not O<ix<n, 
Theorem 2, means that Qa, 


at stage i by Theorem 1, Lemma 
in 

1 ee ue ' 
-passable condition is 


satisfied already for Al, Thus, the switches of 
the last nee of we are redundant leaving us with 
the reduced Qe 


OxvE. Ds 
We now show that E. is always realizable by a 
set of switches. The relation (8) states that E, 
U 
should be aes partition BS. aki oo it is 
a ee 2 ) into two CRS (mod kag Oe ae and 
see by partitioning proper (b> etre, at 
each switch element. Thus the following Lemma 
ensures the realization of Ee 
Lemma 4 
pe CI) 
O<k<2 fa, 5b, | O<k< 


Consider (a, »by), 


oSt-1) 1 a cRs(mod 22), 
tion E such that fa, E 


Then there is a permuta- 
b, £}= fa,,b,} for each k 


€i.e., E is an exehange permutation) and such that 


ta, E o<k<2 64“) } is a CRS (mod g (4-1), | 
Proof 
3% — 1} 
By construction. Let a, = = p,, (mod 2 Dee 
by = q,, (mod am ts 0<k< gst) Every number 
(i=T) 


between O and (2 -1) occurs twice among p 

and q,'s Rearrange the pairs into groups in such 
a way that, in each group with more than one pair, 
any two adjacent pairs have a number in common and 
the first and the last pairs have a number in 
common. In such a group, if the common number of 
the lst and the 2nd pairs occurs in the same posi- 
tion of eacg pair, then exchange positions of the 
two numbers of the seconf pair. By repeating this 
process between the second and the third pairs and 
so on, we obtain (Pye dy)' Ss, such that (PL 2d) = 


{ppd ts o<k<2 4 -1) and all Pye s are distinct. 
e = t. 
Set (a,° E,b, E) ba if P= PK» 
(b, a,) otherwise. 
Then by construction, 
a, *E= p1 (mod gfi-1)y | o<k<2 FY) 


and (a, Elo<k<2‘F-}) 3 is a CRS(mod gfi-1), | 


Note that in the exchange processes in the above 
proof one pair of numbers does not need an 
exchange, indicating the redundancy of a switch in 


Oc 2B: 
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each CRP of E- 


R° An example of Lemma 4 is given 


in Fig. 9 and an example showing the transforma- 


tion of a non-S, 
on 


, ~passable permutation is shown in Fig. 10. 


a : ‘ 
-passable permutation into an. 


In conclusion, the concatenation of the 


reduced ey and ag called the reduced oa 


is 
a (2log N-1) stage rearrangeable network. 
Theorem 3 
The reduced 0,007 is a rearrangeable 
network. 
Proof 
By Theorem 1 and Theorem 2. Q.E.D. 


5. The Reduced 2% 0*—Control Algorithm 
and Redundancy 


In this section, we summarize the reduced 


20 *-control algorithm which has been described 
in the previous sections. We shall call a binary 
tree of CRP's a CRPT. A CRPT is a full binary 
tree of (n-1) levels whose root node is a CRP 
partitioning a CRS(mod 2") and the two sons of a 
node are two CRP's partitioning two CRS's pro- 


(i- 1), 


duced by the parent node. Observe that ER 


the previous section corresponds to the i level 
of a CRPT. 
The reduced 2 9+ control algorithm. 

Li bi reduced ot is ve by a CRPT. 

2. 


algorithm (Sec. 


is eal by the Quy ueoneeel 


2). 


16%16 
is shown in Fig. 11 for the bit reversal permu- 


tation. 


An example of the reduced 2 -control algorithm 


We discussed earlier, in the proof of Theorem 
2, the redundancy of the last switching stage of 
Oe by which we obtained the reduced 2... Further 
redundancy in switches was mentioned after Lemma 
4, 
The ia of the CRP's needed for the reduced 


Qa Nee Sebaenel algorithm is, 


No. of CRP's for the reduced St = 
a Oe 
Thus the number of necessary switches in the 
reduced wt we =A is, 


No. switches in the reduced I" 
(2n—-1) *nw/2 - (N/2-1) = (Nlog N-N+1) 
This is exactly the same as the number of switches 
required for the reduced Benes network [Waks63], 
[Joel68] which is a lower bound for rearrangeable 
networks. 


The control by. a CRPT requires O(Nlog N) 


time steps, and the oF *~control requires O(log N)_ 


time steps. Therefore, the time complexity of 


the reduced 2 0" -control algorithm is O(Nlog N). 


‘When more hardware is available for control, the 
CRP's on the same level of a CRPT can be done in 
parallel reducing the CRPT control time and thus 
the overall control time to O(N+N/2+...+2)=0(2N). 
These control time complexities are in the same 
order of magnitude as those of the looping 
algorithms for the Benes network [OpTs71]. But 
our algorithm is simpler as inverse mappings are 
not needed, and it is easier to understand as the 
switches are set stage by stage. 


A characteristic of a CRPT control is that 
any fixed connection at the beginning of a re- 
arrangeable network is redundant. Therefore, the 


first shuffle of the wadueed on ay can be removed 


and the resultant network is still rearrangeable. 


6. Conclusion and Extension 


Employing a new global approach a proof of 
the rearrangeability of the reduced 29 was 


given. The proof method can be used for non- 
symmetric as well as symmetric networks. We have 
succeeded in constructing a rearrangeable network 
with the same hardware requirement as the reduced 
Benes network. A control algorithm with the same 
control time complexity as that of Opferman and 
Tsao-Wu's looping algorithm was described. Our 
control algorithm is simpler because calculations 
of the inverse mappings are not required, and it 
is easier to understand because the switches are 
set stage by stage. 


We have considered only the N=2" case with 
(2x2) switches in this paper. An immediate gener- 
alization may be to the N=p"™ case with (PxP) 
switches for p>2 [Lee ]. Another extension is 
being investigated on the proof of the rearrange- 
ability for the networks resuiting from concate- 
nation of two delta networks [Pate79]| based on 
the CRPT related equivalence relation [Lee ce 


We could have used the known network equiva- 
lence relations to obtain many rearrangeable 
networks with the minimum hardware requirement 
[Lee ]. Some networks surely will be nonsym- 


metric (for ex, BQ“), One of the networks that 
can be obtained in this way is the reduced Qo”. 
Parker showed that Qo is rearrangeable [Park80] 
without giving a control algorithm. Now we have a 
control algorithm for the (2log N-1) stage reduced 
@o2 where the reduced % is controlled by a 
modified CRPT and the second {% is controlled by a 
single bit control algorithm. 


Even the Benes network can be thought of as 
another (2log N-1) multistage network, and if we 
use a similar approach, we get a new Benes network 
control algorithm [Lee }. Thus, the notion of 


symmetry and recursiveness which differentiated 


- the Benes network from other multistage networks 


\ 
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disappears and they can be treated as a family of 
rearrangeable permutations networks with the 
minimum hardware requirement. 


Now the bottleneck of the control algorithm 
lies in the set partitioning problem. If we can 
find a fast set partitioning algorithm, the 
network control time can be greatly reduced. 
Also, the non-uniqueness of the CRP might be use- 
ful for fault-tolerant network designs. 
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Abstract 

Design issues for the multistage Generalized 
Cube network are discussed in this paper. An 
analysis of the merits of 2-input/2-output inter- 
change boxes versus 4-input/4-output crossbars for 
interconnection network implementation is made. 
The cost and performance of each network for the 
two switching node alternatives are examined. 
Discussion of the suitability of each approach for 
VLSI implementation is included. It is shown that 
in a packet switching environment, 4x4 crossbars 
outperform, and are less expensive to implement 
than the four interchange boxes they replace. 


I. INTRODUCTION 

The choice of interconnection network is a cen- 
tral issue in the design of large-scale, 
multimicroprocessor~based distributed and parallel 
systems, The Ballistic Missile Defense (BMD) 
Agency is designing a test bed for evaluating such 
systems as they may apply to BMD tasks [C8]. PASM 
is a multimicroprocessor system being designed at 
Purdue University for a variety of image process~ 
ing and pattern recognition problems [16]. In 
both cases a highly flexible network is needed for 
communication among processors and memories. 

The Generalized Cube network has a _ cube-type 
topology and is constructed from 2-input/2-output 
crossbars or interchange boxes [17]. A more’ gen- 
eral form of interchange box is an a-input/a-~ 
output (a x a) switching node. A relative of the 
Generalized Cube network can be constructed from 
a X a switching nodes using cube-type connections 
between stages. Many papers in the Literature 
discuss using larger than 2x2 interchange boxes 
for implementing multistage cube-type networks (2, 
7, 10, 11, 12, 18]. In the following, design op- 
tions for 4x4 switching nodes are considered. The 
performances of two designs are evaluated = and 
their implementation in discrete Logic (e.g., TTL) 
and VLSI is considered. It will be shown that a 
4x4 crossbar performs better and costs less than 
four 2x2 crossbars in a packet switching environ- 
ment. 

The logical structure of the Generalized Cube 
network is defined in Section II to provide a 
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framework for discussing modifications. In Sec- 
tion III, the performance of two network implemen- 
tations are compared. Implementation considera- 
tions are presented in Section IV. For further 
details of all this material see (14]. 


II. DEFINITIONS 

A partitionable SIMD/MIMD system is a parallet 
processing system which can be structured as one 
or more independent SIMD and/or MIMD machines [4] 
of varying sizes. PASM is a_epartitionable 
SIMD/MIMD system for image processing and pattern 
recognition [16J. The BMD testbed should have the 
flexibility to perform as a partitionable 
SIMD/MIMD machine. The cube network described 
here can function efficiently in such an environ= 
ment. 

The Generalized Cube network (Fig. 1) is a mul=- 
tistage cube-type network topology which was in- 
troduced in C17]. It has been shown that this to~ 
pology is equivalent to that used by the omega 
C7J, indirect binary n-cube [11], STARAN [1], and 
SW-banyan (F=S=2) [6] networks (17, 20]. An N 
input/output Generalized Cube topology has 
mn = LogoN stages, where each stage consists of a 


set of N Lines connected to N/2 interchange boxes. 
Each interchange box is a 2-input/2-output device. 
The labels of the input/output (1/0) Lines enter~ 
ing the upper and lower inputs of an interchange 
box are used as the Labels for the upper and tower 
outputs, respectively. The labels are the in- 
tegers from 0 to N-1. Each interchange box can be 
set to one of four states as shown in Fig. 1. The 
connections in this network are based on the cube 
interconnection functions (131. Stage 71 of the 
generalized cube topology pairs I/0 Lines that 
differ only in the i-th bit position. 

The name cube network will be used to refer to 
the network consisting of the Generalized Cube to~ 
pology and four-state interchange boxes. Each in- 
terchange box will be controlled independently 
through the use of routing tags [7, 15]. 


STRAIGHT 


EXCHANGE 


UPPER 
BROADCAST 


LOWER 
BROADCAST 


SA 


(a) (b) 


Figure 1(a): Generalized Cube topology for N=8. 
(b): Four states of an interchange box. 


It is assumed that processors and memories are 
paired to form processing elements (PE's). The 
network is configured such that PE i is connected 
to input i and output i, O<i<N. 
switching mode, in which packets move from stage 
to stage in the network as paths between stages 
become available, is assumed. They do not require 
that their entire path be established prior to 
entering the network. A packet consists of a 
routing tag and a number of data items. Packet 
switching in multistage networks has been dis 
cussed in (3, 19]. 

The primary goal here is to investigate the 
cost-effectiveness of constructing multistage cube 
networks from 4x4 crossbars versus 2x2 crossbars 
(interchange boxes). Since a single 2x2 inter- 
change box is not functionally comparable to a 4x4 
crossbar (i.e., it can only handle two items at a 
time instead of four), the 4x4 crossbar is com- 
pared with a 4x4 composition of four 2x2 inter- 
change boxes. This configuration is called a 
composite node and is shown in Fig. 2. A network 
constructed from properly connected (to be speci- 
fied later) composite nodes is identical to a cube 
network constructed from 2x2 interchange boxes. 
The external connections of the crossbar (Fig. 3) 
are identical to those of the composite node, so 
it can be directly substituted for a 4x4 composite 
node. 

Many options for the implementation of 2x2 in- 
terchange boxes were discussed in [9]. To avoid 
repetition, one of the configurations discussed in 


that paper will be assumed here. It is assumed 
that packet switching is implemented and that an 
entire packet is transferred between adjacent 
stages during one network clock cycle. Further- 


more, the size of each input queue in a switching 
nede is assumed to be an integral multiple of the 
packet size. The packet size is thus not res~ 
tricted to be any particular number of words. 


R-REQUEST G-GRANT 


A 4x4 composite node constructed from 
four 2x2 interchange boxes. 


Figure 2: 


The packet 
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III. PERFORMANCE ANALYSIS 

The 4x4 crossbar node and composite node will 
be compared in their performance at both a Local 
and global level. On a local level blocking 
within a node is examined. On the global Level, 
the permuting ability of two networks constructed 
from the respective 4x4 switching nodes is com 
pared. 

Consider the local level. 
composite 
nected to the inputs of the node and 4 
the two interchange boxes connected to the out- 
puts. The composite node can perform 16 permuta~ 
tion connections (each box either straight or ex- 
change) and the crossbar node can perform all 4! 
possible permutation connections. 

For those permutations where there is no con 
flict in either node, the messages traverse the 
composite node in twice the time required by those 
in the crossbar node due to the two levels of in- 
terchange boxes. When conflicts occur in the 
crossbar node, the delay due to waiting diminishes 
the speedup achieved. 

Consider situations where there are conflicts 
in a switch node. For this analysis it is assumed 
that the destination of any message is a uniformly 
distributed random variable. Also, it is assumed 
that each message has only one destination (i.e., 
no broadcasting). Both the composite node and the 
4x4 crossbar node have four inputs and four out- 


Let levei 1 of a 
node be the two interchange boxes con- 
level 2 be 


puts so there are 42256 distinct patterns in 
which messages may need to be routed through the 


boxes. Since the destinations are assumed to be 
random and uniformly distributed, the distinct 
data patterns of routing are all equally Likely. 


Assuming four simultaneous inputs is somewhat of a 
case, since in MIMD mode this would be con- 


io) 


weet 


CONTROL UNIT 


R=REQUEST G-GRANT 


Figure 3: A 4x4 crossbar node. 


sidered heavy loading and in SIMD mode destina- 


tions are not random but structured and chosen to 
avoid conflicts. The node is assumed initialiy 
empty. 

Consider the 4x4 crossbar node. Let r be the 


maximum number of messages desiring any given out- 
put of the 4x4 crossbar node. The total time re- 
quired for all four messages to pass through the 


node is r. P(r=1) = 24/256, PCr=2) = 180/256, 
P(r=3) = 48/256, and P(r=4) = 4/256. The expected 
time to pass all four messages through the 
crossbar node is given by: 

4 

D i(r=i) = 2.125 network clock cycles. 

{=1 
That is, given that four messages arrive at = an 


empty crossbar node simultaneously, on the average 
it will take 2.125 network clock cycles for the 
node to empty. 


Now consider the composite node. The following 


notation will be used in the ensuing equaticns, 
where i=1 or 2: 
PCiU) = P(no conflict level i, upper box) = 1/2; 
PCiL) = P(no conflict level i, Lower box) = 1/2; 
P(iX) = 1/2, where X = U or L3 and 
P(i) = P(no conflict in Level i) = 1/4. 
Now consider the probabilities of different 
amounts of time, t, to pass four input messages 


through the composite node. The minimum time pos 
sible is 2 network clock cycles because there are 
two levels. 

P(t=2) = P(1U) PCIL) PC2U) *PC2L) = 1/16. 

For a total time of 3 network clock cycles 
there are 5 cases to consider. First assume no 
conflicts occur in level 1. 

P(t=3, case 1) = P(1)°(-P(2)) = 3/16. 

Next, assume exactly one Level 1 
box has a conflict. P(t=3, case 2) = 
CC1-P (1U)) *P C1LI4+P CTU) 2° (1 -P (1L) D7 oP (2X) = 1/4. 

For case 3, there is one conflict at each 
el, but the maximum delay is 3 cycles. 

P(t=3, case 3) = CC1-P(1U)) PCILI+P CU) 9 (1 -P (1L) 7 

© (1 -P (2X)) ° (1/2) PX) = 1/16. 
The first factor is the probability that exactly 
one box at level 1 has a conflict. The next fac~ 
tor is the probability that the first message from 
the level 47 box which had a conflict, call this 
message M, also has a conflict at level 2. The 
(1/2) is the probability that M will be chosen to 
pass through the level 2 box first. The last fac- 
tor is the probability that the two delayed mes- 
sages do not conflict. 

Case 4 assumes that there is a conflict in both 
level 1 boxes and that both level 2 boxes receive 
messages (this happens half the time there are two 
conflicts in level 1). | 
P(t=3, case 4) = (1/2)°(A-PMU))°QA-PQL)) = 1/8. 

Finally, assume conflict in both level 1 boxes 
but only one level 2 box receives messages and 
there is no conflict for either pair that passes 
through: P(t=3), case 5) = 
(1/2) *(1-P (1U)) 9 (1 =P C1L)) *P (2X) *P (2X) = 1/32. 

The probability that all messages pass through the 
composite node in 3 network clock cycles is 
P(t=3) = 3/16 + 1/4 + 1/16 + 1/8 + 1/32 = 21/32. 

For a time of 4, there are four cases to con- 
sider. The first case is where there is one con- 
flict at each level. There are two ways to obtain 
a time of 4 from this situation: (1) the delayed 


interchange 


Ley- 
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message enters a non-empty queue in level 2 and 
(2) the delayed message enters an empty queue but 
conflicts with the other remaining message: 

P(t=4, case 1) = C(1-P(1U)) *PCIL)+P (10) °C -P ILD) 
°£ (1/2) *(1-P (2X) + 1/2) °C1-P (2X) 9 (1 =P (2X) =3/16. 

Now assume conflict in both level 1. boxes and 
that only one level 2 box receives messages (this 
happens half the time there are two conflicts in 
level 1). Given this occurs, there are three ways 
(cases 2, 3, and 4) a time of 4 occurs. In case 
2, the first two messages reaching the box in lev- 
el 2 conflict, but there are no subsequent con- 
flicts: 

P(t=4, case 2) = (1/2)°(1-P(1U)) °*(-P(1IL))° 
(1-P (2X)) °P (2X) = 1/32. 

In case 3, the first pair of 
conflict but the second pair do: 
P(t=4, case 3) = (1/2)°(-P(U))*G-PCIL))® 
P(2X) °(1-P (2X0) = 1/32. 

In case 4, the first and second pair of mes- 
sages conflict. When the second pair conflicts, 
one queue will contain two messages. For a time 
cf 4 the queue with two items must be selected to 
resolve the second conflict and a third conflict 
must not occur. 

P(t=4,case 4) = (1/2)*(-P(U)) °QA-P(L)) 
°(1-P (2X)) *(1-P (2X)) ° (1/2) 9P (2K) = 1/128. 

The probability of a time of 4 is: 

P(t=4) = 3/16 + 1/32 + 1/32 + 1/128 = 33/128. 

The time of 5 happens when either of the two 
conditions of case 4 for a time of 4 are not met. 
P(t=5) = (1/2) °C -P (1U)) °C -P (ILD) °C -P (2X)) 
°£(1/2) (1 -P (2X) +01 /2) C1 -P (2X) 1 -P (2X) = 3/128. 

The expected time for all four messages to pass 
through the composite node is: 

5 

>» i * P(t=i) = 
i=2 
This time is 53% Longer than the 2.125 network 
clock cycles expected with the crossbar node. 

Consider the global level. To construct a net~ 
work from m/2 stages of N/4 4x4 switching nodes, 
assume all connection Lines in the network are la- 
beled in base 4 and that the stages are numbered 
(m/2)-1,°°°,1,0 (from input to output). At stage 
i, the four input Lines to a node are those that 
differ only in the i-th position of their base 4 
representation. The line with a QO in the i-th po- 
sition connects to the top input, 2 to the next 
input, 1 to the next input, and 3 to the bottom 
input. The output Lines of the 4x4 switching 
nodes have the same Labels as the input Lines, but 
in increasing order, i.e., the top output line la- 
bel has a0 in the i-th position, next 1, next 2, 
and the bottom 3. When composite nodes are used, 
making connections in the above manner creates a 
cube network. When crossbars nodes are used, a 
network is created whose capabilities are a super- 
set of those of the cube network. 

A composite node network consists of Nm/2  in- 


terchange boxes, allowing oNm/2 permutations. ASs~ 
suming m is even, a 4x4 crossbar node network con- 


of Nm/8 nodes, permitting (4 4) Nn/8 permuta- 


If m is odd and one stage is constructed 
crossbar nodes Limited to act as a 2x2 
N/2 (4 yy NCm-1) 78 


messages do not 


3.242 network clock cycles. 


sists 
tions. 
by 4x4 


nodes, then 2 
sible. 


permutations are pos- 


IMPLEMENTATION 

the destination tags 
Let the destination ad- 
as dni °° da dy 


A switching node in stage i examines bits do s44 
and dos. 


interchange 


IV. 

To control the network, 
defined in 7] are used. 

dress D be represented in binary 


Level 
and the 


For the composite node, the first 


boxes examine only bit dos 44 


second Level interchange boxes examine only bit 


do.- If the bit examined is 0, the upper output 


Link of the interchange box is selected and if the 
bit is 1, the lower Link is selected. For the 
crossbar node, both bits are examined simultane- 
ously. Together they are considered a single base 
four digit which corresponds to one of the outputs 
Labeled 0 through 3. 

To add a broadcast capability, an mbit broad- 
cast mask is appended [15]. Let the mask B be 
represented in binary as or b A switching 


1°0° 
baie ¢ 
34 For the composite node, first level 


change boxes examine bits with index 2i+1 and 
second Level boxes examine bits with index 21. If 
the broadcast mask bit is 0, the destination tag 
bit is interpreted as before. If the mask bit is 
1, the destination bit is ignored and both output 
Links of the interchange box are selected. For 
the crossbar node the four bits are all examined 
simultaneously. They are interpreted so as to es- 
tablish the same connections as those that would 
be obtained in the composite node. Five kinds of 
broadcasts are defined for either type of 4x4 
Switching node. 


node in stage 7 now examines Dosa and 


oan 


eit 
inter- 


Hardware Without Broadcast Capability 
For simplicity, designs for the composite node 
and the crossbar node initially will be developed 
assuming no broadcast capability. Then, those 
portions of the designs affected by inclusion of a 


broadcast capability will be modified and com- 
pared. 
In the following analysis, hardware complexity 


measured in terms of logic gate count and chip 
count. The gate counts are used as a first ap- 
proximation to compare VLSI implementations. 
Designs using this technology must also consider 
wiring complexity [C5]. The chip counts are used 
to compare discrete logic (e.g. TTL) implementa- 
tions, assuming standard gate-per-chip packaging. 

Examining Figs. 2 and 3, the first difference 
noted is that the crossbar node requires half as 
many queues as the composite node. Depending on 
the actual queue size, a considerable savings in 
logic may be realized in the implementation of the 
crossbar node. To compare multiplexer require- 
ments, typical implementations of 2-to-1 and 
4-to-1 multiplexers were examined [14]. Eight 
2-to-1 multiplexers require 20% more gates (re- 
gardless of path width) than four 4-to-1 multi- 
plexers. The chip counts are equal. Since’ the 
number of external connections for data and con- 
trol Lines is the same for both designs, any 
buffering/signal conditioning logic will be com- 
parable. In a VLSI design, this implies identical 
pin counts. 

Thus far the crossbar node appears to be the 
better choice. It is however, decidedly more com- 


is 


Zo2 


made 


plicated to arbitrate the requests of four packets 
simultaneously (as opposed to two) while assuring 
each packet equal access to each output Link on 
the average. To determine whether one 4x4 control 
unit is actually more complex than four 2x2 con- 
trol units, the functional components of the con- 
trol units are considered. 

The control unit of a 2x2 interchange box con- 
tains two sets of queue control logic, input re- 
quest arbitration (IRA) logic, output request ar- 
bitration (ORA) Logic, and timing. The control 
unit for a 4x4 crossbar node contains four sets of 
queue control logic. The remaining components are 
the functional equivalents of those for the 2x2 
interchange box. The most obvious difference 
between the two designs is that four 2x2 control 
units contain twice as many sets of queue control 
Logic as one 4x4 control unit. 

One set of queue control logic 
gisters which store pointers, one 
one to the back of its associated 
queue is Q words Long, Log,@ bits 


contains two re- 
to the front and 
queue. If the 
are required for 


each register. 

The IRA logic is quite simple. If a request is 
for ‘the i-th input, (i=0,1 for the 2x2; 
i=0,1,2,3 for the 4x4), it will be granted if the 
i-th queue is not full. Once again, four 2x2 con- 


trol units require twice as much IRA Logic as_ one 
4x4 control unit. 

The timing logic is identical in both cases. 
Three clock phases are generated. A 
request/grant/transfer protocol is implemented 
(see (9]). 


None of the logic discussed thus far is affect- 
ed by the inclusion of a broadcast capability. 
Thus, its analysis is equally applicable to the 
next subsection, which includes broadcast capabil- 
ities. 

The most important and by far the most 
component of the 


complex 
control unit is the ORA Logic. 


It is responsible for examining the routing tag 
bits and generating signals to set the multi- 
plexers and make requests. It must also examine 


the grant signals and generate control signals for 
the “increment front pointer" input of each set of 
queue control logic. The complexity of this Logic 
arises from arbitrating conflicting requests for 
access to the output ports. 

To compare the ORA Logic, equations are derived 
for all its output signals as a function of the 
tag bits and grant signals (14]. The total (NAND) 
gate count for 4 sets 2x2 of control unit logic is 
104 gates. This corresponds to 24 chips. The 
control unit for the 4x4 crossbar node requires 
124 gates. There is a 19% increase in the number 
of gates required by the crossbar node. In a 
discrete logic design, the chip count is 32. This 
is a 33% increase over the 24 chips required in 
the composite node. 

The excess in ORA Logic can be compensated for, 
since a 4x4 crossbar node requires half the queue 
control and IRA Logic of a 4x4 composite node. 
From the equations derived, 20 extra gates or 
eight extra chips are required for the 4x4 
crossbar ORA Logic. Assuming one of the eight 
sets of queue control and IRA Logic in a composite 
node will require more than 5 gates or 2 chips, 
the 4x4 crossbar node is actually less expensive 
to build. Despite the higher wiring complexity of 


the 4x4 crossbar node, the total design effort is 
comparable to that required by the 4x4 composite 
node. 


Hardware With Broadcast Capability 

Adding a broadcast capability requires the ORA 
Logic to examine the broadcast mask bits in addi- 
tion to the routing tag bits. The revised equa- 
tions for the 2x2 control unit require 33 gates, 
which multiplied by 4 is 156. This is equivalent 
to 48 chips. A broadcasting capability costs 52 
gates or 24 chips beyond the requirements for a 
4x4 composite node without it. More details can 
be found in [14]. 

The circuitry needed to add the same broadcast 
capability to 4x4 crossbar nodes as was added to 
the composite nodes requires 233 gates, a 49% in- 
crease over the 156 required for the composite 
node. The chip count is 74, a 54% increase over 
48. In this case it is likely that one of the 
eight sets of queue control and IRA Logic will re- 
quire more than 20 gates or 7 chips. If not, the 
savings in queue gates will compensate for the 
difference. Again the crossbar node is less ex- 
pensive than a composite node where both have the 
same broadcast capability. 


V. CONCLUSIONS 

At a local level, the crossbar node is always 
faster at passing four messages that arrive simul- 
taneously than the composite node. If the connec- 
tion requests do not conflict in the composite, 
the crossbar is twice as fast. When the connec~ 
tion requests of the messages form a permutation 
which the composite node cannot pass without con- 
flict, it takes 3 times longer for all messages to 
exit the composite node. Assuming each message 
chooses each output with equal probability, on the 
average it takes approximately 53% more time for 
all messages to pass through the composite node 
than through the crossbar node. 

The ORA Logic is the only logic requiring more 
hardware in a crossbar node than in a composite 
node. Otherwise, a crossbar node requires half as 
much queue control and IRA Logic, and half as many 
queues. The multiplexer Logic is less than or 
comparable to that needed by the composite node. 
The net result is that when packet switching is 
implemented, the 4x4 crossbar node requires less 
hardware and significantly out-performs a compo- 
site node. 

If circuit switching is implemented, no queues 
or their associated control logic are required. 
In this case, the crossbar node does contain more 
hardware. However, it offers a significant im- 
provement in connectivity/permuting ability. If 
the switching nodes are implemented as VLSI chips, 
Since both nodes require the same number of pins, 
the gate/pin ratio is improved with a crossbar im- 


plementation. Only in the case where circuit 
Switching is implemented in discrete Logic is 
further consideration required. Without.a broad- 


cast capability (which is less important in a cir- 
cuit switching environment), there is only a small 
difference in the chip count. 


In summary, the implementation of cube-type 
networks using 2x2 and 4x4 crossbars were com- 
pared. It was shown that for packet switching the 


4x4 crossbar is a more cost-effective approach. 
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Abstract 


This paper presents a systematic way of 
finding whether Multistage Interconnection (MI) 
Networks with log,N stages and implemented with 
2x2 Switching Elefients (SEs) are non-equivalent 
or not. The basic strategy is to employ a graph 
theoretic approach to model the MI networks in 
the form of the directed graph and use non-iso- 
morphic properties of the loops or cycles in cor- 
responding undirected graphs. The distance con- 
cept of a binary tree is utilized to ensure the 
full connectivity requirements of the MI net- 
works. 


I. Introduction 


Various MI networks described in the litera- 
ture [7] are topologically equivalent to each 
other [1-3,7] as the SEs of the networks are con- 
nected in such a way that they possess a "buddy" 
property [4]. This means that the outputs of two 
SEs at stage i are connected as inputs to only 
two SEs at the (itl)th stage and thereby a uni- 
formity is provided in the network. Patel [5] 
has stated that probably there are only two non- 
equivalent 8x8 networks and named these as delta 
networks. For finding non-equivalent networks, 
we considered several alternatives and the use of 
graph theory looked to be more promising. 


II. Graph Model of the MI Network 

We are concerned with N-input N-output net- 
work (N=2") utilizing 2x2 SE and having n stages 
connected in such a way that it provides full 
connectivity, i.e., any input terminals can be 
connected to any one of the output terminals. 
Fig. il-a shows the SE and its two possible 
states and Fig. 1-b shows the proposed graph 
model of Fig. l-a with the directions indicating 
flow of data. Note that the control line is to- 
tally eliminated in graph representation of the 
SE. Note that contrary to the topology describ- 
ing rules [7], the SE with link patterns as shown 
in Fig. l-c is modeled and treated just like the 
one shown in Fig. 1-b. The MI network can now be 
modeled as a directed graph by assigning nodes to 
each SE and input/output terminals, and providing 
connection links between nodes of various stages. 
This leads to the graph model of the Omega Net- 
work of Fig. 2-a as shown in Fig. 2-b. 


It is also worth noting that the part of the 
graph (or subgraph) representing the connections 
between two stages are bipartite [6] in nature 
and the input/output terminals of Fig. 2-b can be 
excluded without losing any topology information 
thereby making the analysis much simpler. The 
reduced graph is shown in Fig. 2-c. Separating 
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each N-node bipartite subgraphs leads to exactly 
(n-1) such subgraphs and such subgraphs are shown 
in Fig. 2-d. 

III. Single Level Partitioning of Stages 

With the proposed graph model, the non-equi- 
valence between possible MI networks is trans- 
formed to non-isomorphism of bipartite subgraphs. 
Theorem 1: There exist at least one closed loop 
(cycle) in a bipartite subgraph of the MI net- 
work. 
Proof. From Section II each bipartite subgraph 
consists of N/2 nodes in each set with N edges, 
each node of one set is connected to two nodes of 
another set, and there has to be at least one un- 
directed closed path. Q.E.D. 
Corollary 1: All the paths 
graph form the closed loops. 
Proof. This is obvious from the characteristics 
of bipartite graph with each node of degree 2 and 
total number of edges is the same as number of 
nodes. Q.E.D. 
Corollary 2: The number of paths, j, constitut- 
ing a loop is determined by the relation j=4+2°i, 
where i = 0,1,2,...,(N/2-4), (N/2-2) and i gives 
the integer value satisfying corollary 1 and the 
loop will be defined as jP-L. 
Proof. Follows from Corollary 1. Q.E.D. 

For instance, the bipartite subgraph of 23 


in the bipartite 


inputs MI network has only two possible kinds of 
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loop structure and these 4P-L and 8P-L are shown 
in Fig. 3-a. For the case of 2* inputs/outputs 
network it is possible to make i equal to 0,1,2, 
3,4,6 and these loops correspond to 4P-L, 6P-L, 


8P-L, 10P-L, 12P-L, and 16 P-L as shown in Fig. 
32D, 
Theorem 2: Among all possible bipartite sub- 


graphs of the MI networks, the subgraphs are not 
isomorphic to each other iff the number and type 
of loops are not exactly the same. 

Proof. Obviously, the loops consisting of dif- 
ferent number of paths are non-isomorphic due to 
their own unique structure. If the number and 
type of loops are exactly the same in two bipar- 
tite subgraphs, then they can be shown to be iso- 
morphic by changing their positions to correspond 
to each other. Q.E.D. 


It may be noted that all the MI networks 
having same number of and type of loops may not 
be isomorphic and will be considered in Theorem 6 
of section V. 

Example 1: Two different loops for the bipartite 
subgraphs of 8 x 8 network result into two dif- 
ferent topologies of Fig. 3-a (Theorem 2) 
[2(4P-L) means two 4P-L]. 

Example 2: For the case of 16 x 16 network, six 
different kinds of loops are possible and we can 
obtain 7 different topologies as illustrated in 


Fig. 3-b. 
IV. Isomorphism of MI Network 
Theorem 3: Reordering the positions of the nodes 


in the same stage of the MI network does not af- 


fect the loop structure within each bipartite 
subgraph. 
Proof. It is obvious from the graph theory [6] 


that reordering of a graph will provide another 
graph which will be isomorphic to the first 


graph. Hence, loop structure remains the same. 
Q.E.D. 
Theorem 4: All possible combinations of the 


graphs resulting from grouping and joining of two 
or more non-isomorphic bipartite subgraphs are 
also non-isomorphic. 

Proof. A network with one specific topology 
created by combination of some bipartite sub- 


graphs provides its own unique combined loop 
structure and according to Theorem 3, a graph 
guarantees the uniqueness of the loops. Q.E.D. 


V. Loop Analysis 

Let us modify the bipartite graphs of Fig. 
2-d in the form of planner graphs as shown in 
Fig. 4-a while its loop structure is maintained. 
The next step is to connect the nodes of two 
adjacent levels. This is done on the assumption 
that output nodes of the loops at the first level 
are at the input nodes at the second level. 
Hence, if we assume 4 and 4‘ etc. are not separ- 
ate but the same nodes then we obtain the graph 
of Fig. 2-c back. These overlappings of nodes 
are indicated by a simple line connecting nodes 4 
and 4'. Following the same procedure, it can be 
shown that 8x8 Baseline Network [7] will have the 
same loop diagrams as shown in Fig. 4-b and only 
difference will be numbering of nodes. 


This synthesis procedure could be used to 
define the composite graph of all possible MI 
networks which are not isomorphic to each other. 
This could be done starting from the non-isomor- 
phic loop structures and interpolate the un- 
numbered nodes of various stages with restric- 
tions to satisfy full connectivity requirements. 


Connecting the Squared Loops 


A. Draw the loop structures and mark alternate 
nodes for inputs by '*' and the rest for the 
outputs by '‘'x'. The output nodes of the 
first level are to be overlapped or merged 
with the input nodes of the next level and 
this process is to be performed for all 
levels in such a way that full connectivity 
is provided by the resulting network (Fig. 
4-b). The basic requirement is to form a 
binary tree from each input node [3] and 
hence in Fig. 4-b the output nodes of level 
1 are connected to input nodes of level 2 in 
such a way that the minimum distance between 
input nodes in loop of level 2 to be at 
least 2 times the distance between the cor- 
responding output nodes of level 1. if 
we have many levels, then the minimum dis- 
tance to get the full connectivity has to be 
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ita is where 2 is the level number of 
the loops. The distance between different 
loops of the same level is regarded as in- 
finite. 
Theorem 5: The above design procedure provides 
full connectivity in the resultant MI network. 
Proof. The proof is obvious from the tree forma- 
tion procedure [3]. Q.E.D. 
B. If each level consists of more than 2 loops, 
then different topology of connecting 
schemes for a given set of loops might 
exist. For instance, from Section IV, 3 
different topologies are possbile for 8x8 
network. But according to Theorem 5, we see 
that only two topologies of [2(4P-L); 
2(4P-L)] and [2(4P-L);8P-L] provide full 
connectivity (Figs. 4-b and 4-c) and the 
third [8P-L;8P-L] of Fig. 4-e fails to sat- 
isfy the distance requirements. 
Theorem 6: For a given specific structure of the 
loops, it is possible that different alternative 
connections of loops might provide more than one 


non-isomorphic graph and hence non-equivalent 
networks. 
Proof. The proof of Theorem 6 will be given 


using an example 
possible ways of 


to show that there are many 
connecting the loops at two 
adjacent levels. Take an example of Fig. 5-a 
which consists of four 4P-L in every level. To 
see the ways of connecting scheme explicitly, we 
represent each (4P-L) loop by a Macro node as 
shown in Fig. 5-a. This modified structure has 
already been covered in Fig. 4 and the two non- 
isomorphic schemes are shown in Fig. 5-b, leading 
to two different connections as in Fig. 5-c. The 
actual graphs resulting from the loop diagrams of 
Fig. 5-c showing two different topologies are 
illustrated in Fig. 5-d. Q.E.D. 


VI. Construction Algorithm of the MI Networks 

a. For each level, select the loops with 
each loop consisting of j nodes and 
edges where j = 4 + 2*:i for i = 0,1, 
2,..-,(N/2-4), (N/2-2). 


In each loop, assign alternate nodes as 
for inputs and outputs. 


C; Connect adjacent levels of loop diagram 
such that they satisfy Theorem 5 and 
assign numbers to all the nodes. 


Obtain the graph model of the MI net- 
work by merging each adjacent levels of 
the loop diagram. 


Convert it to the normal diagram show- 
ing 2x2 SEs. Choose any model of the 
SE from Figs. l-a or 1-c to be used. 


VII. Concluding Remarks 

In this paper, we presented a systematic 
methodology for determining non-equivalence of MI 
networks and designing such networks. It is our 
firm belief that the novel approach introduced in 


this paper provides a better perspective of the 


non-equivalent MI networks. 
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Fig. 2-b Graph model of Omega network. Fig. 2-c Simplified 
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Fig.3-a Two kinds of bipartite 
graph of 8 x 8 MI network. 
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Fig. 3-b Seven kinds of bipartite graph of 16 x 16 ME network. 
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marked by 'e' and 'x' as ; 
L2 L6 L10 
input and output nodes 
7 3 ll 7° respectively. 2(4P-L) 2(4P-L) 
a ; i ois from 4(4P-L) 4(4P-L) 4(4P-L) 
° a 
LO L4 L8 
O e) O 
Ll L5 L9 
.@) 0 O 
L2 L6 L10 
.@) .@) 
L3 L7 LI1 
.@) oO fo) 


Macro node structure 


2(4P-L) 8P-L 


Fig. 4~b Full connectivity is satisfied in this 
connection. The distance between input Fig. 5-a Loop structure of Fig. 5-b Macro loop diagram 
4 and 5, for example, is greater than 3. 4(4P-L) ;4(4P-L) ;4(4P-L). 


2(4P-L) 8P-L 
Fig. 4-c loop structure of 2(4P-L);8P-L. 


Fig. 4-d One example of graph which 
has loop structure of Fig. 4-c. 


Fig. 5-d Graph models resulted from 
Fig 5-c Two different ways of Fig. 5-c. They are non-equi- 
connecting loops. valent. 


8P-L ‘ 8P-L 
Fig. 4-e Loop structure which does not meet full connectivity. 
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INTERCONNECTION TOPOLOGIES FOR FAULT-TOLERANT 
PARALLEL AND DISTRIBUTED ARCHITECTURES 
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ABSTRACT 


Communication architectures are presented, 
suited for the design of fault-tolerant parallel 
and distributed processors. Attractive features 
of these communication topologies include small 
internode distances, low interconnection complex- 
ity, ease of message routing, modularity, fault- 
tolerance, and reconfigurability. 


1, INTRODUCTION 


A key component of parallel/distributed pro- 
cessor architecture is the interconnection struc- 
ture used for communication between processors, 
memories, etc. This paper presents a new class of 
interconnection topologies which combine some of 
the interesting features of the existing struc- 

- tures, like the loop [1], the binary tree [2], and 
regular networks [3]. The following elaborates on 
some of the important features of the i 
interconnection structures; 

(i) suitability for parallel processor 
design: interconnection patterns ~- such as bin- 
ary tree [2], perfect shuffle [4], inverse perfect 
shuffle [4], uniform shift [4] - are embedded in 
our structure. Thus, the topology presented here 
is well-suited for various parallel algorithms 
which are used for sorting, FFT, matrix manipula- 
tions [4]. Also, other algorithms which are based 
on principles of ''divide and conquer! or recursion 
can be well-executed because of the imbedded bin- 
ary tree structures, 

Furthermore, our interconnection structures 
are amenable to parallel processor design in VLSI 
because of the simplicity of the layout and the 
low degree of interconnection. 

(ii) suitability for distributed processor 
design: the proposed topologies have sma} | 
internode distances, ease of message routing and 
modularity. Thus, they satisfy some of the key 
requirements to be useful] as communication archi- 
tectures for distributed systems [5]. 

(iii) fault-tolerance: fault-tolerance is 
becoming an increasingly important attribute of 
computer systems. The topologies given here are 
fault-tolerant against both node and link failures. 
Also these topologies possess relatively short re- 
dundant paths. Consequently, fault recovery and 
reconfiguration can easily be achieved without 
significant degradation. 

This paper is organized into the following 
sections. Section 11 develops certain notations 
and definitions. Section Ill presents the pro- 
posed topologies. Also included here are certain 
related results and routing algorithms. Next in 
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Section IV fault-tolerance, reconfigurability and 
modularity of the topologies are studied. Follow- 
ing this, techniques that modify the topologies 
are presented which achieve greater fault~tolerance. 


l!. NOTATIONS AND DEFINITIONS 


An interconnection topology will be repre- 
sented as an undirected graph, G, with n nodes, 
(Ov 2isae gy iT) 

An redGes e(i,j), in G represents a bidirec- 
tional data link between i and j. 

A pair of nodes, i and j, are neighbors of 
each other if there is an edge, e(i,j), in G. 


The degree, d(i), of node i is the number of 
neighbors of i. 
Let d = min {d(i)|0<i<(n-1)} represent the 


minimum of the degrees of nodes in G. 


Let D = max {d( (i) | Osi<( n-1)}, represent the 


maximum of the degrees of nodes in G. 


Let k(i J) represent the distance between 
nodes i and j. The distance, k(i,j is equal to 
the number of edges in the shortest path between 
i and j. If there are no paths between i and j, 
then k(i,j) will be assumed to be ~. 

Let k = max {k(i (i,j) [O<t,j<(n-1)} represent 
the diameter and be the maximum internode distance 
in G. 

Let k, represent the maximum of the diameters 
of all graphs that can be obtained from G by re- 
moving some e nodes from G, This k, will be re- 
ferred to as the e-diameter of G. 

Thus, ke can be viewed as a measure of the 
worst case propagation delay when e (faulty) nodes 
are removed from G. 

Let c denote the node connectivity of G. 
Thus, c represents the minimum number of nodes 
which when removed from G, will disconnect G. 
Obviously, c<d and k, = ©. 

Thus, (c-1) will be referred to as the fault- 


tolerance of G, 


Consider G, shown below. Here, d=2, D=3, 


k=2, k,=3 and c=2. 


d(5)=3(( 


o(4,5) | e(2,3) 


e(3,4) 


lll. -ENTERCONNECTION TOPOLOGIES 


This section presents two different classes 
of system topologies. Although both classes have 
certain similarities, they are significantly dif- 
ferent in structure, and each class will be shown 
to possess certain distinct advantages over the 
Other. 

System Topology | (ST-1) 

A pair of nodes, i and j, are neighbors if 

they satisfy any of the following relationships: 


(a) j = i+] modn 
(b) j = i-1l mod n 
(c) j = 2i modn 
(d) i = 2j modn 


System Topology II (ST-11) 
Nodes i and j are neighbors if they satisfy 


any of the following relationships: 


a) i = 2j  modn 
(b) j = 2i modn 
(c) i = 2j+1 mod n 
(d) j = 2i+] mod n 


Fig. 1 and Fig. 2 illustrate a 12-node ST-| anda 
ST-I1l, respectively. 

The ST-I represents a superimposition onto a 
basic loop structure. However, when compared to 
that basic loop structure, the ST-I will be seen 
to have significantly smaller internode distances; 
also, it is not vulnerable to single point fail- 
ures. 

The second topology, ST-!1!, (a generaliza- 
tion of Pradhan-Reddy [6]), has better internode 
distances than ST-I in the absence of any faults. 
On the other hand with a single node failure ST-| 
has the potential to achieve better internode dis- 
tances than ST-II. Furthermore, ST-! enjoys cer- 
tain implementational and functional advantages 
over the ST-I| because of the following: 

In the ST-I, unlike in the ST-I1, two nodes 
that have consecutive logical addresses are neigh- 
bors; and thus, are adjacent to each other. Con- 
sequently, at least half of the links in the ST-] 
can be short and laid out ona single plane. So, 
data exchanges between i and i+] (which consti- 
tute a major portion of the overall communication) 
can be done rapidly. 

The following notations are used throughout 
this paper: : 

Notations: (i) [x] denotes the smallest integer 
that is greater than or equal to x. 

(ii) log x denotes [logox]. 

(iii) all arithmetic operations used 
in respresenting node numbers are modulo-n opera- 
tions. 

The following theorems provide certain char- 
acterizations of ST-I and ST-I]. 

Theorem 1: In ST-1, if n=odd and n>5, then: 
i) node 0 has degree 2 a 
(ii) nodes 1, 2, (n-1) and (n-2) have 


degree 3 O 
(iii) all other nodes have degree 4 
Theorem 2: In ST-1, if n=even and n>8, then: 
(i) the nodes 1 and (n-1) have degree 2 


(ii) all other odd nodes and the node 0 
have degree 3. 

(iii) nodes 2, (n-2)(and n/3, 2n/3 if 3 
divides n) have degree 4 

(iv) all other even nodes have degree 5 
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Fig. |. ST-I for n=12 


Fig. 2. ST-I1l for n=12 


Theorem 3: In ST-I1, for n28: 
(i) the nodes 0 and (n-1) have degree 2 
(ii) the nodes of degree 3 
(n/3), (2n/3), (n-3)/3, (2n-3)/3, 
are ]- if 3 divides n 
(n-1)/3, (Zn-2)/3, if 3 divides (n-1) 
(n-2)/3, (2n-1)/3, if 3 divides (n-2) 
(iii) all other nodes have degree 4 O 
Theorem 4: In ST-I for all n,ne>5, the total 


ene. 


(2n-4) if 6 divides n 

(2n-3) otherwise 0 

for all n, n>8, the total 

aed if 3 divides n 
(2n-3) otherwise a) 

and ST-I1 require a small 

and have low degree of inter- 


In ST-11 
number of data links 


Thus, both ST-1L 
number of data links 
connection. 

The following develops message routing stra- 
tegies and upperbounds on path lengths in ST-! and 
ST-I I. 


Theorem 5: 


Message routing 

The following develops a distributed routing 
algorithm for ST-I. Analogous procedures for 
ST-I1 can easily be formulated. 

First a technique is developed to construct 
a path from a given source node, i to a destina- 
tion node j. With this technique, a message rout- 
ing algorithm is developed. This algorithm uses 
certain tag (control) bits. These bits are gener- 
ated at the source and carried by the message. 
Each intermediate node use these tag bits to de- 
termine the next node in the path. Minor modi- 
fications of the tag bits are carried out by the 
intermediate nodes. As will be seen the routing 
procedure is very simple and requires only a small 
number of tag bits which are easily generated. 


Thus, the communication overhead can be fairly low. 


Definition: 


Let m, m>0, be the least integer for 
which: 


(j - i gmt) mod n Pay aes 
Obviously there always exists such an m where 
m< log n. soca 
Let p= (j - i 2) mod n. 
_ m- | m-2 
Let p = Pan? + Pao +...+ p,2 + Po 


in radix-2 (binary). 
The following procedure constructs a path 
from i to j in ST-1. 


Path Construction Procedure, Pl, for ST-I: 

The following procedure is described by using 
a pointer, x, which travels through the path from 
i to j. 

The value of x during each iteration repre- 
sents successive nodes in the path. 


St 
$2: 
Let h=h-1, if pp =0. Then let the new value 


Let x=i and h=m 
Repeat following while h>l: 
of x=2x. On the other hand, if pial; then first 
let x=(xtl), and next, x=2x. 

$3: 


stop else stop. 


Finally, if Pot! then let x=(x+1), and 


The following two observations may be made: 
(i) the final value of x at the end of the 
procedure is (j2m-Ty p) which is equal to j, the 


‘destination node. 
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(ii) the path length is, at most, (2m-1). 
Routing Strategy: 


Each message may be formatted, as shown in 
Fig. 3. The message has two tag bit fields, a 
and 8B, which consist of m and log m bits. 


| Destination pice " 
; Address 4g 


one er ome 


ae ett et etl 
m logm m 
Fig. 3. Message Format 
The source node, i, computes p = (j - i gm) 
mod n. Then a and 6 are initialized to p and 


(m-1), respectively. 
O<q<m-1 and 8 = (m-1). 

When a message arrives at node y, the follow- 
ing steps are carried out. The source node also 
uses the following procedure to route the message 
to the next node. 


Step 1: if the destination address is y,. then 
the message is removed; otherwise Step 2 is per- 
formed. 


Step 2: if ag = 0, then let 8 = 8 - 1, and the 
message is forwarded to node 2y next. 


On the other hand, if ag = 1, then first let 
ag = 0 (change ag to 0) and forward the message 
to the node, (y+). (Thus, if ag = 1 then the 
message is effectively forwarded to 2(y+1) via 
(y+1)). 

Thus, the above procedure is simple and 
requires only(m + log m)bits - a small number 
compared to the number of nodes n since m = log n. 
However, it should be noted that above procedure 
does not always result in routing the message 
through the shortest path. : 
Example 1: Let i = 3, j = 9. Consider routing 
from 3 to 9 in ST-!I in Fig. 1 where n=12. 

The least m that satisfies, 

(j - i 2™1) mod n < 2-1, is m= 2. 
Thus, p = 3 and hence, a = p =(11) and 
m- 1=1 in binary. (Initially a] = % = 1.) 
Hence, it can be computed that the message 
will go to node 9 via nodes 4 and 8. The tag bits 
at each node are shown in the following Table. 


Thus, at the start,a =p 
q q 


B 


node at arrival at departure 
OL B a, B 
3 11 ] 01 1 
\ ol 01 0 
8 01 0 00 
9 00 0 


The following theorems are direct conse- 
quences for the routing procedures for ST-1 and 
ST-T1. 


Theorem 6: For. ST-i, the diameter, k, is bounded 
from above by 


k<2(logn) -1. G 


Theorem 7: For ST-I1, the diameter, k, is bounded 
from above by 
k < log n. 0 
IV. FAULT-TOLERANCE AND RECONFIGURABILITY 


This section formulates results regarding the 
fault-tolerant capacity of the ST-! and ST-I1. 
Techniques are also developed that reconfigure the 
paths in the event of a fault. The proposed re- 
configuration will be shown to cause only a mini- 
mal degradation of the path lengths. 

Although the fault-tolerance of the ST-I jis 
self-evident, (because of its embedded loop struc- 
ture) the effect of a fault on path lengths is not 
obvious. In the case of the ST-II, neither the 
fault-tolerance nor the effect of fault is readily 
apparent. The results of this section will pro- 
vide some insight into these aspects. 

The following describes a routing strategy 
for ST-1 that can be used to bypass any faulty 
node with only a resulting minimal increase in the 
path length. 

Routing procedure for ST-| with faulty node, t 

From the routing procedure developed earlier 
for the ST-1I, the following observations may be 
made: 


(i) Any intermediate node, t, in a message path 
may receive the message from only two of its neiglr 
bors: the node (t-1) or (t/2). 


(ii) There are only two possible neighbors of t to 
which the message may be forwarded from t; these 
are 2t and (2t+1). | 

In the following, we show a technique for 
sending the message from {(t-1) or (t/2)}to {(t+1) 
and 2t}without going through t. Consider the 
following paths where 3<t<n-1: 


t-t—>2t-2 —>» 2t-]—»2t —>2t+]1—>2t+2 —> t+] 


t/2—>t/ 2+] —>tt+2 —> tt l—»2t+2 —>2t+] —> 2t 


It may be noted that the above paths are of 
length 6 and do not pass through t. Thus, if the 
message is routed through one of these alternate 
paths, then there will be a net increase of only 
4h in the path length. 

For t = 0,1,2, (n-1), (n-2), a set of paths 
of length, at most, 6 can also be constructed that 
allow for sending the message from {(t-1), (t/2)} 
to {(t+]l), 2t} without going through t. 

For example, let t = 2. In this case, (t-1)= 
(c/2)=1, (tt+l)=3 and 2t=4. Consider the fol- 
lowing path for n even: 


1-7 0 —> n/2 ~» n/2+1— n/2+2 7AM~4 > 3 


Thus, the node, 2, can be bypassed with a longer 
path. 

The following is an immediate consequence of 
the above observations: 


Theorem 8: The I-diameter, ky» of ST-! 


from above by: k, < 2(log n) +3 


is bounded 
0 
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Theorem 9: 


Thus, Ky is comparable to k. 

Now we will develop a routing procedure for 
ST-Il. This procedure is significantly different 
from that given for ST-I, and is based as the 
following Lemmas: 


Lemma 1: For any node, i, in ST-II, there is a 
path between 0 and i of length, at most, log n, 
where each intermediate node, y, in the path is 
Strictly less than i, y<i. Oo 


Lemma 2: For any node, i, in ST-I1, there is a 
path between i and (n-1) of length, at most, 

log n, where each intermediate node, y, in the 
path is strictly greater than i, y>i. Oo 


Lemma 3: There are two node disjoint paths be- 
tween O and (n-1) in ST-II, of length, at most, 
log n. O 


A message routing algorithm that can be used 
in the event of a fault can be formulated by using 
the following path construction procedure Po. 


(i) Let O-x (x+0) represent a path, 
(x to 0), that satisfy Lemma |. 

(ii) Let (n-1)>x (x+(n-1)) represent 
(n-1) to x (x to (n-1)), that satisfy 


Notations: 
from 0 to x 


a path from 
Lemma 2. y y 
(iii) Let O0+(n-1) ((n-1)30) represent 
a path of length, at most, log n from 0 to (n-1) 
((n-1) to 0), that do not pass through the node, 
y. Lemma 3 guarantees the existence of such paths. 


Path Construction Procedure, P2, for ST-I! with 


Faulty Node t 


i:—Source j:—Destination 
Case t=0 
““p(t): i>(n-1)>j 
Case II t=(n-1) 
p(t): +09] 
Case !11 t#0 and t#(n-1) 
a t>i and t<j 
p(t): i05(n-1)>j 
(b) t>i and t>j 
p(t): i>0>j 
(c) t<i and t<j 
p(t) i+(n-1)>j 
(d) t<i and t>j 
p(t): i+(n-1)>0 j 


The paths satisfying Lemmas 1, 2 and 3 can 
be constructed by using procedures that are simil- 
ar to those given in Pl. Therefore, a routing 
strategy like the one formulated for ST-1, using 
tag bits, can also be developed here for ST-IlI. 
lt may be noted that for ST-I1, in the event of a 
fault, there is a potential for ''bottleneck'' be- 
cause each message has to be forwarded through 0 
or (n-1). The following Theorem is immediate from 
the above path construction procedure. 


The I-diameter, ky of ST-I! is bounded 
from above by 


k, < 3(log n) -1 a 


The following discusses possible modifica- 
tions of ST-! and ST-I1 for greater fault-toler- 
ance. It will be seen that the addition of a 
single link can make these 2-fault-tolerant; i.e., 
any two (faulty) nodes can be removed. 

Since d=2, for ST-I, I! and d>c, in order 
to increase the connectivity, c, one has 
to increase d. There are only two nodes of de-. 
gree 2; therefore, one can increase d by simply 
adding. a link that connects these two nodes. In 
the following, we. show that the fault-tolerance 
is also increased with the addition of this extra 
link. 


Modified System Topology | (MST-1) 
Add link 4 e(0,2) to ST-| if n=odd 
a ~ e(1,n-1) to ST-1i if n=even 
Modified System Topology I|1(MST-11) 
Add link e001) to ST-Il. 
By adding the link e(1,11) to Fig. 1 and 


e(0,11) to Fig. 2 one can obtain a 12 node MST-! 
and MST-II respectively. 


Theorem 10: The connectivity, c, of MST-I is 3 


Theorem 11: The connectivity, c, of MST-II is 3 
and ky < 2m, when n = 2™, 0 
MODULARITY 


Modularity is an attribute of interconnection 
Structures that refers to the ease with which in- 
cremental changes can be made. 

The following illustrates a general technique 
for extending ST-I and ST-I1, without significant- 
ly altering the existing system. 

Let G be an existing ST with n nodes. If it 
is required that an additional n' node may be ad- 
ded, the following scheme may be used: 

First, a new BST, G', is constructed using n' 
nodes. The nodes, 0 and (n-1) in G, are connected 
to 0' and (n'-1) in G, respectively. 

This will not increase the D of either G or 
G' since nodes 0 and (n-1) have a degree of, at 
most, 3, in both ST-I! and ST-!1. 

The diameter (l-diameter) of the composite 
system will be the sum of the diameters (I-diamet- 
ers) of G and G'. | 

Subsequently, if it is required that a furth- 
er set of n'' nodes is to be added, the following 
scheme may be used: 

Construct a new ST G'', with n'' nodes. Then 
this is inserted between nodes 0 and 0! by con- 
necting O and (n''-1), O' and O' shown jn 
Fig. 4. 

The diameters will now again be the sum of 
the diameters of G, G' and G''. This latter pro- 
cess obviously can be repeated as many times as 
it may be required. 


REMARKS : 

1. (a) Given any node, y, one can reach nodes 2y 
and (2y+1) in, at most, two steps. Thus, data 
exchanges that correspond to a binary tree can be 
implemented easily for any node as a loot node. 


(b) One can reach both (y/2) and (2y) from any 
node, y, in one step. Therefore, data exchange 
patterns that correspond to perfect shuffle and 
inverse perfect shuffle [4] can be implemented 


easily. 


(c) Given enough buffer capacity at each 
node data exchange patterns that correspond to 
any arbitrary permutation can be implemented in 
order log n (O(log n)) steps. 


Thus, from (a), (b) and (c), one may infer 
that the topologies are suited for fast implemen- 
tation of various parallel processing algorithms. 


2. Some of the key requirements for a topology 
to be suited for distributed processor intercon- 
nection is that it should possess a low degree of 
interconnection complexity, small internode dis-~- 
tances, ease of routing and modularity. All of 
these requirements are satisfied by our topolog- 
ies. 


3. One of the other attractive features of the 
proposed topologies is that they are not only 
fault-tolerant but also easily reconfigurable. 

h, Problems currently under investigation in- 
clude formulation of good routing strategy that 
will allow for bypassing two faulty nodes without 
excessive degradation. (It would be of graph 
theoretical significance to derive a bound on 

the minimum number of edges required to achieve 

a specified ko for a given n and D.) 


Fig. 4 Modular Extension G' 
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FAULT DIAGNOSIS AND DESIGN OF FAULT-TOLERANT CONCENTRATORS 


S. Sowrirajan and S. M. Reddy 
Division of Electrical and Computer Engineering 
The University of Iowa 


Iowa City, Iowa 


Summary 


Switching networks may be classified as con- 
nectors, concentrators, partitioners and expand- 
ers [1]. A concentrator is a contact switching 
network that provides a number of potential users 
(connected to its inputs) with access to a smaller 
number of equivalent resources (connected to its 
outputs) 24. In a concentrator the outputs to 
which the inputs are to be connected cannot be 
specified a priori. A concentrator can be repre- 
sented by the triplet (I,0.r), where I is the set 
of inputs, 0 is the set of outputs and r is the 
crosspoint placement relation between I and 0. 
The capacity of a concentrator is said to be C if 
any k inputs, k < C, can be connected to some k 
outputs simultaneously. 


Definition 1: (n,m) = (1) = CoNrie A binomial 


(n,m) concentrator is a concentrator having (n,m) 
inputs, n outputs such that each input is connect- 
ed to a unique choice of m out of n outputs by 
crosspoints. The capacity of binomial (n,m) con- 
centrator is shown to be min{m+2,n} [3]. A 
binomial (4,2) concentrator is shown in Fig. 1. 


Fault-Model of the Crosspoints: 
Typically, in a crosspoint network a cross- 


point is connected between an input line and output 
line as shown in Fig. 2. The crosspoint has two 
states under normal operation namely open and 
closed [4,5]. 


Definition 2: A crosspoint is said to be faulty 
if it is permanently in one of the two states 
namely, close and open. A crosspoint is said to 
be "stuck-at-close" (s-a-c) if it is permanently 
in the closed state and it is said to be "stuck- 
at-open" (s-a-o) if it is permanently in the open 
state. This is shown in Fig. 3. 


Single Fault-Diagnosis in a Binomial (n,m) 
Concentrator: 

~~ A binomial (n,m) concentrator has (n,m)-m 
crosspoints and n outputs. Hence at most n cross- 
points can be tested for a single s-a-o fault at 
any one instant - one crosspoint per output. 

Hence the optimal number of test sets to diagnose 
a single s-a-o fault is (nsm) ™ = (n-1,m-1). The 
Theorem 1 says that such a test set can be ob- 
tained for a binomial (n,m) concentrator. 


Definition 3: Let U = (Ay: icI) be a family of 
subsets of a set E. Suppose that it is possible 
to select one element x; from each set Aj in such 
a way that the elements xj, icI, so selected are 
distinct. Then the set {xq: x; # xj ¥ i, jeI and 
i # j} of these elements is cal led a transversal 
of U. 
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Definition 4: Let Bs = {j: j is an input such 
that there exists a crosspoint (j,i) and i is an 
output}; 1 < i < n. Since there are (n-1,m-1) 
crosspoints on each output line |B;| = (n-1,m-1), 
1 < i <n, where |X| is the cardinality of set X. 


Definition 5: A transversal T; is said to be 


ordered if and only if Tj = (x5 poXa ger o%a gee 


Xn 5) where Xa eB, 1 < j < (n-1,m-1) and 
< J <n, 

Definition 6: Two ordered transversals T; and Tj 
are disjoint if and only if x, 7 FX, - lyk) 
T<eke«en. ; ; 


Theoren 1: (n-1,m-1) mutually disjoint ordered 


transversals Tyee eT (ny m1) on the family 


(Bs: ie {1,...,n}) of subsets of inputs exist and 
can be calculated. For a proof see [7]. 


Procedure to Diagnose Single Fault in a Binomial 
(n,m) Concentrator: 


Step 1: Set all the crosspoints to state open and 
apply a 1 to all the inputs. If there is an out- 
put such that a 1 is received on this output then 
there is a S-a-c faulty crosspoint on this output 
line and do Step 2; else go to Step 3. 


Step 2: Test each crosspoint on this output line; 
stop. 
Step 3: At instant j, 1 < j < (n-1,m-1) test the 


crosspoints e 15 for s-a-o fault. If a s-a-o 
fault is diagnosed at instant j then stop; else if 
j= (n-1 m1) then there is no faulty crosspoint; 
else j<«jtl and go to Step 3. 


Procedure to Diagnose Multiple Faults ina Bi- 
nomial (n,m) Concentrator: 

Step 3 in the above procedure is to be modi- 
fied to take into account s-a-c faulty crosspoints 
on output lines x1,X2,...,X,. If the crosspoint 
(y,x) is s-a-c and if ye T! for some j, 1 <j < 
(n-1,m-1), then while testing crosspoints eT}, the 
output x will receive a 1 and hence the crosspoint 
(yz.x) such that yzeTs cannot be tested for s-a-o 
fault. This fact is taken into consideration to 
obtain the test sets. For more detail see [7]. 


Capacity of the Binomial (n,m) Concentrator in the 
Presence of a Single Fault: 


Theoren 2: Let the crosspoint (a,b) be faulty. 
Then the capacity of the binomial (n,m) concen- 
trator in the presence of a faulty cnosspoint 
(a,b), denoted by Ca b)? is equal to (C-1) under 


the assumption that m+2 < n. For a proof see [7]. 


Capacity of the Binomial (n,m) Concentrator in 


the Presence of k Faults: 


Theorem 3: The capacity, Cy, of the binomial 
(n,m) concentrator with k<m faults = (C-k) where 
C is the capacity of the fault free binomial (n,m) 
concentrator under the assumption that mt2 < n. 
For a proof see [7]. 


It can be shown that there exists a fault 
pattern of k crosspoints such that Cy, = C-k. 


The above two theorems give us a methodology 
to design a k fault tolerant, m+k < n, binomial 
concentrator. A binomial (n,mt+k) concentrator 
has a capacity > (C-k) if k or less faults occur 
where C is the capacity of binomial (n,m) concen- 
trator. The number of inputs is however (n,mtk). 
The number of inputs can be increased to m+(n,m+k) 
without affecting the capacity of the concentrator 
with k faults. | 


Definition 7: An (m,0) concentrator is a concen- 
trator with m inputs x1,X2,...5X_ and one output 
y, such that there exist crosspoints (xGoy7); 
1l<is<m. 


Definition 8 [6]: An (m,0) x (n,m) concentrator 
has m-(n,m) inputs and n outputs such that each 
crosspoint in the (m,0) concentrator is replaced 
by the binomial (n,m) concentrator. 


Theorem 4: An (m,0) x (njmtk), mtk < n, concen- 
trator in the presence of k faults has a capacity 
> C, where C is the capacity of the binomial (n,m) 
concentrator. For a proof see [7]. 
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Abstract 


The suffix problem has appeared in solutions 
of recurrence systems for parallel and pipelined 
machines and more recently in the design of gate 
and silicon compilers. In this paper we present an 
algorithm that generates parallel suffix solutions 
with minimum cost and size for a given length, time 
delay, availability of initial values and fanout. 
This algorithm generates a minimal solution for any 
length n and depth range from [log,n] to ne 


i. Introduction 


The suffix problem has appeared in solutions 
of recurrence systems for parallel and multiproces-— 
sor machines [Gajs81] and in design of gate and 
Silicon compilers [GaBL81], [Kris81], and [{LaFi80]. 
Many operations on a register-transfer level (addi- 
tion, comparison, prioritization etc.) can be sim- 
ply described by Boolean recurrence systems. The 
solution of the suffix problem is the single most 
important part of the more general solution of 
recurrence problems, which can be _ trivially 
extended to the solutions of fixed-length problems 
solved by finite-state transducers [LaFi80]. In 
this paper we give a new algorithm for generating 
area and time efficient parallel solutions for the 
suffix problem. 


The solution of the recurrence system of 
length n and order 1, denoted by R<n,1l>, is 


x, = £.(x. 
i {°"1- 


» 


for all i, l<i<n, and given x,. Furthermore, 


0 


os ie ha es 


ee Aa eg Pia) ag) 
= (£ Of, 10° 2 00f,0f |) (XQ) | 
= £,,(%) 


where the symbol © denotes the composition of func- 
tions. Thus, the solution of every recurrence sys- 
tem can be decomposed into two subproblems: 


This work was supported in part by the National 
Science Foundation under grant No. US NSF MCS80- 
01561. 
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Functional-Composition 


(a) Suffix problem: the computation of the func- 
tional composition f,, = £,0f O«e-Of Of, for 
alit. icin, and” oe os 


(b) Functional evaluation: the 
x, = £ (Xp) for all i, 1l<i<n. 


Subproblem (a) can be solved by a tree-like 
network, consisting of identical nodes, called 
Cells (FCCs). Each FCC 
takes two functions fi and t4 as inputs and gen- 


computation of 


erates their composition es 


Subproblem (b) is solved by n identical nodes, 
called Functional-Evaluation Cells (FECs). Each 
FEC takes a function fi. and its argument Xp as 


inputs and generates £7 (XQ) 


Detailed treatment of recurrences can be found 
in [BiGa8l]. 


More abstractly, the suffix problem can be 
defined for any semigroup <S ,O>3 given 
Ss , €S, compute each of the products 


a enH1? °° 25189 
Ss, Os O.+-90S,0s. for all k, Ox<k<n. 


PR KS k=l 1 "0 

Each suffix-problem solution can be 
represented by a directed acyclic oriented graph. 
Each node of in-degree 2, called a product node, 
represents a product of its two inputs. The input 
nodes have in-degree 0 and are labeled with an ele- 
ment S,ES- The output nodes (the nodes that 


represent the solution of the suffix problem) have 
in-degree 1 and are labeled with an element p,eS- 


Hence, each node in the graph represents either an 
element from. S or a product of some elements from 
S, which is itself an element from S.- Two dif- 
ferent graphs for the suffix problem of length 9 
are shown in Figure 1. The numbers along the right 
side of the graphs denote levels, which correspond 
to time steps. 


We will introduce several complexity measures 
used to characterize any suffix graph G. The size 
of G, s(G), is the number of product nodes in G, 
while the depth of G, d(G), is defined as the max- 
imum number of product nodes on any directed path 


in G. Thus, d(G) equals the maximum level. The 
depth of G is proportional to the time delay 
through G. For the two graphs G, and Gy shown in 


Figure l, s(G,)=17 and s(G,)=15 while the depths of 


G, and G, are d(G , )=d(G,) =4- 


p, es, is the number of product nodes on the path 


The cost Cis of a 


from S. to Pas The cost of G, c(G), is defined as 


Sq S34 "6 0 4 210 $3 §7 2654 9 977-6 


My 


: We iF 


4 : 4 


So 


Pg Pa Pe Ps Py P43 Po Py Po Pg P7P¢ P5 Py, P3 Po Py Po 
(a) (b) 


Figure 1. Two graphs for suffix problem of 
length 9 and depth 4. (a) Gis (b) G,s 
O<i<n. 


max c If the nodes are laid out in a 


i’ 
two-dimensional array such that each node drives 
only nodes in the same row to the left and in the 
same column below, then the silicon area occupied 
by the layout of G is linearly proportional to the 
product nxc(G). In our example, c(G,)=4 and 


c(G,)=3- £(G), 
out-degree of any node in G. 


£(G,)=5- 


The fanout of G, is the maximum 


So, £(G,)=6 and 
The last parameter that characterizes the 
suffix graph G is the initial-value-availability of 


G, e(G), which is the minimum number of levels 


after which Sy can be used as an input to a product 


is -the first element and corresponds to 


node. So 


the initial value of the recurrence system. For 
example, the initial value for parallel adders is 
the value of the input carry. This value may not 
be available at the same time as the rest of input 
values. For the two suffix graphs in Figure l, 
e(G,)=e(G,)=2. In comparison of the two different 


graphs shown in Figure 1, the solution represented 
by Gy requires less area of silicon, less power, 


and has smaller 
represented by Gy 


is smaller then in Gye 


in better performance in the implementation of G 


time delay then the solution 


Furthermore, the fanout in G, 


This will eventually result 


9° 

Many papers considered problems related to the 
suffix problem but never proposed algorithms based 
on suffix-problem solutions. Brent and Kung 
{[Brku79], considering a regular layout for parallel 
binary adders, proposed a suffix solution with 
depth 2log,n and size 2n-2-log,n with fanout f=2. 


Ladner and Fischer [LaFi80] were the first to 
define the suffix problem, although they call it a 
prefix problem. They developed an algorithm for 
constructing suffix solutions with minimum size for 
the given length n of the suffix problem and the 
depth of the solution. However, their algorithm 
works only for depth [log,n] <d< 2[log.n]- The 


solutions for lengths that are not an integer power 
of 2 are not optimal. They did not consider either 
the cost of the suffix-solution layouts or the 
fanouts of their solutions, although they have 
given an upper bound on fanout. 


In this paper we present a different algorithm 
that generates suffix solutions with minimum cost 
and size for a given length n, depth d, initial- 
value-availability e, and fanout f.- Our algorithm 
generates a minimal solution for any integer n. 


Furthermore, the depth range is extended to include 


all depths from [log n] to ne The solution with 


depth n really represents a serial solution with 
size n and cost le In the case of binary adders 
this solution corresponds to a ripple-carry adder. 


The complete algorithm consists of two parts 
and it is presented in the following two sections. 
In Section 2 we present Algorithm 1, which gen- 
erates a minimum-cost solution for any given 
length, depth, initial-value-availability, and 
fanoute In Section 3 we present Algorithm 2, which 
takes the solution generated by Algorithm 1 and 
minimizes its size by using only local optimiza- 
tions. Comments on our approach and description of 
some open problems are given in Section 4. 
Finally, in Section 5 we show that for all practi- 
cal values of n our algorithm, in comparison with 
the Ladner-Fischer algorithm, generates suffix- 
solutions with smaller sizes. 


TG Nanna caer nnetieaniNNte, 


A graph that produces the solution for a suf- 
fix with length ntl, given the constraints d 
(depth), e (level after which So is available), and 


f (fanout) is denoted by G<n,d,e,f>. 
shows G<8,4,2,5>. 


Figure 2 
Figure 2-a has the following 
interpretation. Every vertical line is called a 
column. Each product node is represented by a 
number and is driven from two other nodes. One 
driving node is always above the driven node, in 
the same column, while the other driving node is 
located in some column to the right of the node and 
is called the right-driving node. Each group of 
nodes driven from the same right-driving node is 
connected by a horizontal lines The right-driving 
node is located in the column to the right of the 
group and one level above it, and their connection 
is represented by a diagonal line. The graph can 
be laid out in a number of rows that is equal to 
the cost ce The number that stands for a node 
indicates the number of the layout row that the 
node belongs to. Figure 2-b shows the relative 
locations of the nodes in the layout. Connections 
are not shown, but each node drives only nodes to 
its left in the same row, or below it in the same 
column. Figure 2-c shows a binary matrix represen- 
tation of the graph. The presence of a node is 
represented by 1 and absence of a node by 0- Each 
column can be interpreted as a binary number. Let 
the uppermost row correspond to the least signifi-~ 
cant bit. Then the matrix can be represented by a 
series of numbers, as shown in Figure 2-d. Note 
that the series is strictly increasing from right 
to left. Figure 2 and Figure 1-b are different 
representations of the same graph. 

To motivate the algorithm, let us look at the 
graph e<2*-1,%,0,2° 41> which is generated recur- 
sively in an obvious manner.- This graph may bé 


represented by the series <21, 2%-2, 6.0432, 13 
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Figure 2. Graph G<8,4,2,5>. 
(a) Connections scheme. (b) Layout. 
(c) Binary matrix representation. 
(d) Series representation. 
Figure 3 shows this graph for k=4, namely, 
G<15,4,0,9>. Note that k is the lower bound for 


the depth of the solution for the suffix-problem of 


length ok given e=0 and page A 
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Figure 3. Graph G<15,4,0,9>. 


Three important observations can now be made: 


(a) Column eliminations. 2P_1 columns, O<p<k, 
can be eliminated from the right side of the graph 
as well as any number of columns from the left side 
of the graph. The remaining graph still represents 
a suffix problem solution. For example, eliminat- 
ing columns 1, 2, and 3 from the right side and 
columns 14 and 15 from the left side changes the 
graph G<15,4,0,9> to G<10,4,2,7>. More important, 
certain columns can also be eliminated from the 
middle of the graph, as a result of the following 
Lemma 1. 


Lemmg 1 

In G<n,d,e,f>, column j, m>j>l, can be elin- 
inated if re al The remaining graph is 
G<n-l,d,e,f’>, £°<f. o 
Proof 

OSs Ovne } Og 

Let 24 denote aie ae 8541 3 for any 

1,4; n? j>170. From the construction of 


o<2*-1,%,0,2" 7 41> shown previously, it is obvious 
that any section of two consecutive columns (j+l,j) 
will generally look similar to that shown in Figure 
4. That is, there is no node in colum j+l above 
node A, and if there is a node below node A in 
column j+l, then there is also a node on the same 
leyel in column j, and vice versa. There cannot be 


‘Figure 5-b, 


a node in column j on the level right below node B. 
Also, j>x>y. 


fH 
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Figure 4. General two-column section of 


che graph’ C<2"=1,k 0,2" 415: 

The transformation shown in Figure 5 can now 
be carried out, resulting in the elimination of 
colum j- Figure 5-a shows a portion of three con- 
secutive columns (j+l,j,j-l), where C441 S50 In 


the uppermost node of column j+l is 
disconnected from node U and connected to node V 
(in fact, every node in column k, k>j, driven by 
node U will now be driven by node V).- Column j is 
renamed j° and every column k and input node S.> 


n?k>j, are renamed k-1 and s _)? respectively. The 


k 
lowest node in each renamed column k, n-1l?k?j, pro- 
duces now P,* The lowest node in the new column j 


produces p, as does the lowest node in column j’. 


Column j” no longer contains any right-driving- 
node, so it can be eliminated, resulting in the 
section shown in Figure 5-c. Column n has been 
left untouched, and since c,=l, column 1 cannot be 


eliminated. Therefore d and e remain unchanged. 

The fanout may only be reduced. O 
f#Hl oj i Jd ss j jl 
Sey a4 Say S, sr 


pPityy { l \Siuy 
Ng iz (ie (Riz Pie [PL pie (Sik Sine 
| | | | | | 
a i & A a. 
Pati Py P51 Py PG OP 5-1 3 Pye 
(a) (b) (c) 


Figure 5. Elimination of column j. 
(a) Original section. (b) Change of connection. 
(c) Transformed section. 
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Since the graph portion shown in Figure 5-c 
preserves the properties of the general two-column 
portion shown in Figure 4, the transformation can 
be executed iteratively on the transformed graph. 


In particular, columns having the largest 
number of nodes can be eliminated. For example, 
columns 3, 7, and 11 in G<15,4,0,9> have this pro- 
perty and can be eliminated from the graph. Note 


k-1 


that after applying Lemma 1 on e<2*_1,%,0,2 l> 


ok 4-1 times, the graph is reduced to G<k,k,0,2>. 
(b) Segments. Each column has one FEC as the 
lowest node and zero or more FCCs above it. The 
graph is divided into d-e contiguous segments, 
according to the level where the FEC is located. 
Segment t contains columns that have their FECs on 
level t. In G<15,4,0,9> there are four segments: 
1, 2, 3, and 4 In segment t, the FCCs occupy the 
t-l1 first levels: 


no FCC, (tt) =e-1 columns with one FCC in each and 


in general there are (a, columns with i FCCs. 


there are \=1 columns with 


(c) Fanout. The leftmost FEC in each segment 
serves as a right-driving node for all the FECs of 
the next segment to its left. Therefore, the big- 
gest FEC fanout equals the number of columns in the 
largest segment plus one. For example, the biggest 
fanout in G<15,4,0,9> is 9. Each FCC drives only 
nodes in the segment it belongs to, so the biggest 
FCC fanout is always less then the biggest FEC 
fanout. Therefore, if we do -not allow the fanout 
of the leftmost FEC in each segment to exceed f 
when constructing the graph, 
other node is automatically taken care of. 


The minimum cost solution for G<n,d,e,f> will 
be denoted G <n,d,e,f>. We will now present a sim- 


ple but neither time-efficient nor space-efficient 
algorithm to construct G <a,d,e,f>. A more sophis- 


ticated algorithm that generates the same graph but 
takes linear time and constant space is given in 
the Appendix. 


denotes the number of 
denotes the number of 


At any moment, 1’ 
columns in the graph, n. 


columns in each segment t, nS denotes the column 


with the maximum cost in segment t, and j denotes 
the column with the maximum cost in the graph (if 
more then one column qualifies for oF or j, any of 
them will do). 


Algorithm 1 
1. Construct the graph <2¢-1,4,0,2 
If n’<n, stop. G <n,d,e,f> is not realizable. 


d-liis, 


Else continue to the next step. 

2. If e>0, eliminate columns | through 2°~1, 
If n’<n, stop. G <n,d,e,f> is not realizable. 
Else continue to the next step. 


3. For each segment t do | 
while n,>f-1 do eliminate column de 


If n’<n, stop. G <n,d,e,f> is not realizable. 


the fanout of every: 


If n’=n, stop. The current graph is G <n, d,e,f>- 
Else continue to the next step. 


4. While n’>n do eliminate column j. Oo 


Example: construct G <8,4,2,5>- 


1. Construct G<15,4,0,9> (see Figure 3). 
2. Eliminate columns 1 through 3. 


3. Eliminate columns 15, 14, 13, and 11 
in segment 4. 


We are now left with 8 columns. Therefore the 
graph is G<854,2,5>, as shown in Figure 2. 0 


As d approaches n, the algorithm becomes more 
inefficient in terms of space and time. However, 
if we compute in advance the maximum column cost 
and how many such columns are in each segment, then 
the graph can be constructed column by column. An 
efficient algorithm that exploits this idea has 
been devised, so that it takes time proportional to 
ne The details are given in the Appendix. 


For the graph G<a,d,e,ntl>, a simple formula 


that implicitly gives the cost can be derived, 
using elementary combinatorial considerations. The 
cost is the smallest c that satisfies Inequality 1: 


Cc 
ste > 7 
i=l 


Theorem 1 


The cost of G <2" kt2, kt, 2“+1> is [k/2]+1. O 


Proof 


According to Algorithm 1, we execute the fol- 
lowing steps: 


Z 1 


~1,k4+2,0,2%tt41>. 


2. Eliminate columns 1 through gett ae, 


3. This step is not executed, since ea2kt hy, 


4. The graph has now geet columns, from which ok 


columns have to be eliminated (in fact, the 
KA) 142 dl 41s). “the 
k+1 


number of columns having the cost itl is ; 


- "p (e# _jktl og (etl) _f ket | 
a i i)” \e-i+1}? 
O<i<kt+l, after the elimination of ok columns the 


1=0 
maximum cost of the remaining columns is k/2+1 
for an even value of k or (k+l)/2+1 for an odd 


1. Construct G<2kt 


current graph is G <2 


O<i<k+l. 


k value. oO 
Theorem 1 implies the following important 
consequence. Obviously the cost of 


G <2" kt, k,2*+1> (the graph with the minimum depth 
for n=2* , e=k, and f=2*41) is k+l. Then by allow- 
ing the depth d (as well as e) to be greater by 
just one, the cost is reduced to about half (refer 
also to Table l-a, Section 5). 


Figure 6 shows two graphs generated by Algo- 
rithm 1. 
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Figure 6. Minimum—cost graphs. 
(a) G <16,5,2,8>. (b) G <24,5,2,13>. 
Cc Cc 


This algorithm operates on a graph G <n,d,e,f> 


generated by Algorithm 1 and generates a minimun- 
size graph, denoted as G<n,d,e,f>. The algorithm 


performs local optimization on G <n,d,e, f> by try- 


ing to reduce the number of nodes wherever possi- 
ble. Figure 7 illustrates how local optimization 
is accomplished. Figure 7-a shows a section of the 
graph G <64,7,6,65> (columns 24 to 32). The nodes 


surrounded by the dashed box are deleted, and new 
nodes, surrounded by a full box, are introduced as 
shown in Figure 7-b. This transformation, called 
the local-optimization step, reduces the number of 
nodes in the graph while preserving the suffix 
solution P.» 0<k<64. This can be shown by reason- 


ing similar to that utilized in Lemma 1. A local- 
optimization step can be applied once more on the 
nodes surrounded by the dashed box in Figure 7-b. 
The final result is shown in Figure 7-c. 


Note that right-driving-nodes may not be 
deleted, because of their connections to the nodes 
driven by them. Thus, the nodes are classified 
into two groups: (a) fixed nodes, which are the 
right-driving nodes, and (b) non-fixed nodes, which 
are all the other nodes. Fixed nodes are circled 
in figure 7. 


illustrates the general  local- 
The nodes shown in Figure 8a, 
are non-fixed. There may be 
qi but if there is a 
node on level v, Vasvsy,, in column j, k?j2i, then 
there must be a node on level v in each colum Vs 
and these nodes are non-fixed, also. There are no 


Figure 8 
optimization step. 
except for node F, 
nodes on levels vi tt to v 


nodes on level Vvqtl in this section. Figure 8-—b 
shows the section after the local-optimization 


249 


Prrrd? tus 9) Ea ds Pe ad O 
ft tet ret rst ts ty! rst rst rst ty 
2!@!3:!:@:3 1 2!:@:3:@!383 1 
rors to obousb tt borst to torsbor} 
3-3 ! 1!@4! ! 4- 2 saal!.:'@4!! 4- 2 
ror} ie borot lyy foe bot} 
4-4-4-4! !!!:@ 38 444-4-4 ! 1 1! 3 
poyorp bpp yyy pPrrrerg sy 
55-5-5-5-5-5-5l! 4 B-t-f-t-'-t-'-! 4 4 
riobobobr gt ye prprtdztrpergd 
6-6-6-6-6-G6-6'6- 5 i 5 
Ors:8-8-8-6- 5-6 & rir? © 
Pret?r?tprtsp 6 | [(6-6-6-6-6-6-G ! 6 
Prrtrrprprdgs a a 
-7-7-7-7-7-7+7-7-7- 7 -7-7-7-7-7-7-7-7-7- 7 
prorrobrrpepoous propre? egy 
(a) (b) 
ae aes ea ee a ee 8. 
YE Yt t/t yt ty 
2!:@!:3:@!3 1 
bast torstor 
3-! 1 !@4!! 4- 2 
bor Pt tou 
4-!1@4!:3:! 3 
! iP brorisyst 
S-(qj-'-!-'-'-'-! 1 4 
! a prerrgra 
Patal-!- et! 5 
ie oe 2 
! 6-6-6-6-6-6-6 ! 6 
Pyorerteprrpes 
-7-7-7-7-7-7-7-7-7- 7 
Prrdtrdgsprdgs 
(c) 


Figure 7. Local-optimization of columns 24 to 32 
of G <64,7,6,65>- (a) Original section. 


(b) Section after first local-optimization step. 
(c) Section after second local-optimization step. 


d in columns 


i+l through k are deleted and new non-fixed nodes 
are introduced on level v,tl in these columns. 


d 
Node A is marked as fixed, since it now drives the 
group of new nodes. 


The effect of the transforma- 
tion is the reduction of the size of the graph. We 
are now ready to state the algorithm. 


step. The nodes on levels ve through v 


k 
t 
| 
| 


A 
| | | ma 
V +1 
| | | | | i d 
bs ¢ § ¢ 2b 4 
(a) 


Figure 8. General local-optimization step. 
(a) Section before step. (b) Section after step. 


Algorithm 2 


1. Mark fixed nodes. 
2e Scan the graph and perform the _ Ilocal- 
optimization step wherever possible. Oo 


Notes: (a) The cost of the graph is not changed. 
(b) £ is the upper bound for G<n,d,e,f>. 


There may or may not be any node with fanout 
equal to f. 


Step 2 of this algorithm is somewhat : vague, 
since the local-optimization step can replace any 
number of rows of non-fixed nodes with one row of 
new non-fixed nodes. However, if we restrict our- 
selves to replacing only two rows at each local- 
optimization step, we can limit the scanning to 
each individual column independently, as follows. 
The column is scanned from bottom to top. If a 
triple of levels {voVEoV os vaqv, tl, VV, can be 
found such that non-fixed nodes are located on lev- 


els v, and v_ and the levels between v, and v_ as 
b c b c 


well as level Ve are empty, then these two nodes 


are deleted and a new non-fixed node is introduced 
on level Me The net effect of this. local- 


optimization step is the reduction of the number of 
nodes by one. Local-optimization steps are carried 
out iteratively on the colum until no_- such 
sequence is found. Figure 9 demonstrates’ the 
local-optimization done on column 31 of 
G <64,7,6,65> (compare it to Figure 7). 


ne 
1 
19! Yb 
{ N Ne — . —> 
{ 
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Figure 9. Local-optimization on a graph column. 


Figure 10 shows the graphs of Figure 6 after 
applying local-optimization on each column. 
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Figure 10. Minimum—size graphs. 
(a) G,<16,5,2,8>. (b) G5<24,5,2,13>6 


4. Comments and Open Problems 


The local-optimization step can replace any 
number of rows of non-fixed nodes with one row of 
non-fixed nodes. By choosing a different number of 
rows to be replaced in each local-optimization step 
we may end up with graphs of different sizes. Gen- 
erally, the size reduction is greater as the cost 
of the graph grows. It is an open problem on how 
to choose the right number of rows to be replaced 
in each section during a local-optimization step, 
in order to achieve a smaller-sized graph. Also, 
Algorithm 2 operates on the graph generated by 
Algorithm 1, which eliminates columns according to 
their cost. If another strategy for eliminating 
columns is used by Algorithm 1, it may result in a 
smaller-sized graph generated by Algorithm 2. 
Another open problem is finding the expressions for 
the size of G <n,d,e,f> and G<n,d,e,f>- 


Figure ll-a shows a structure of a 16-bit 
adder generated by a silicon compiler using Algo- © 


rithm 1. Each square is a cell that represents a 
node. Rows 2, 3, and 4 from the top form 
G <16,5,2,8> (compare it to Figure 6-a). In Figure 


ll-b, Algorithm 2 was used, so these three rows 
form G<16,5,2,8> (compare it to Figure 10-a). 4 


cells have been saved, but additional connecting 
lines which enter cells A, B, C, and D are intro- 
duced. For details of the cell functions and lay- 
outs see [GaBL81]. 
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Figure 11. Structure of a 16-bit adder. 
(a) Using G <16,5,2,8>. (b) Using G<16,5,2,8>. 


Algorithm 1 accepts the depth d as a parameter 
and generates the minimum-cost graph. A similar 
algorithm can be easily devised that accepts the 
cost c aS a parameter and generates the minimum 


depth graph, denoted as Gj<n,c,e,f>. Gj<n,c,e,£> 


is realizable for every n, c, e, and f. For the 


graph G,<a,c,e,ntl>, the depth is the smallest d 


that satisfies Inequality 1. 
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5. Results and Comparison 


In order to be able to compare our algorithms 
with the Ladner-Fischer algorithm, we will restrict 
ourselves to the_- graphs G<n,d,d-1,nt1> and 


G<n,d,d-1,ntl>, where n is an integer power of 2. 
In these graphs, all the FECs occupy level d. 
Table l-a shows the cost of G <n,d,d-I,ntl>. 


The cost approaches 2 as d approaches n, for n>2. 
Table 1-b shows the sizes of these graphs. Table 
l-c shows the sizes of these graphs after applying 
local-optimization on each column independently, as 
described at the end of Section 3. Table 1-d shows 
the sizes of the graphs generated by the Ladner- 
Fischer algorithm. This algorithm covers only the 
range log, ntl <d< 2log ntl, whereas our algorithm 
covers the full range log ntl <d<¢n.- For all 
practical cases, G<n,d,d-1,ntl> has a smaller size 


then the graph generated by the Ladner-Fischer 
algorithm and sometimes even G <a,d,d-1,n+l> has a 


smaller size (especially when d approaches n). 

n\d 1 2 3 4 5 .6 7 8 9 10 11 12 13 #14 #15 16 17 
] 1 = = a = = = od = = = - = - - = - 
2}o- 2 - - - = = = = = = = = -~ = + - 
4 - - 3 2 7 7 - Ta = - = - - - - - ~ 
ej - = = 4 3 3 3 2 = = = ie ~ = = = 

16 - = - - 5 3 3 3 3 3 3 3 3 3 3 2 - 
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64), - - =~ ~ = = 256 218 209 199 188 179 178 177 176 175 174 
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Table 1. Cost and size of suffix-graphs. 
(a) Cost of G <n,d,d-1,ntl>. 


(b) Size of G <n,d,d-l,ntl>. 
(c) Size of Gj<n,d,d-1,ntl>.- 


(d) Size of graphs generated by 
the Ladner-Fischer algorithm. 
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Appendix 


We present here an efficient algorithm to con- 
struct G<a,d,e,f>. First, we will introduce nota- 


tions for the algorithm. The word "current" means 
"at the moment of execution", as opposed to 
"final", which means "after completion of the algo- 
rithm". The parameters n, d, e, and f have been 
defined in Section 1. Some notations have slightly 
different meaning in each part of the algorithm. 
References to the algorithm parts are made where 
confusion might be caused. 


c - Cost of current graph (part a); 
Cost of final graph (part b). 

t - Segment number. 

e.* Cost of current segment t (equals the 
maximum cost of a column that belongs 
to segment t) (part a); 

Cost of final segment t (part b).- 

ce ~ Maximum cost of column to be added to 
segment t. 

q - Maximum number of columns each having cost c 
that current segment t may have if the fanout 
is unlimited. 

An - Number of columns still to be added to 
current graph. 

Af - Number of columns still to be added to 
current segment t. 

n. - Number of columns in current segment t. 

m -~- Number of columns with cost Cy in current 
segment t (part a); 

Number of columns with cost c. in final 
Segment t (part b). 
ia Status of current segment t: 


open: n 


t 
current segment t); 


<f-1 (columns may be added to 


Closed: n =f-1 (segment t is full or 


maximum fanout has been reached). 
f° - Current. fanout of the leftmost FEC in 
segment t-l. 
j  - Column number. . 
b - Binary number representing column j. 
u(b) - Position of the least significant 1 in b 
(l.seb- is in position 0). 
w(b) - Position of the most significant 1 in b 
(1l.-seb. is in position 0). 
z(b) - Cost of column represented by b 
(equals number of 1’s in b). 


Algorithm 1’ 


Part a: Evaluation of c, and ma, for each 


c 
segment t in the graph. 


{ initialization } 
for t « etl to d do 


begin 
n <1 
s. ¢ open 
end 
An <n 
ec «0 


{ main loop } 
repeat 
ec ¢ ctl 
t<¢e 
repeat 
t <« ttl 
if s,=open then 


begin 
(1) 


if q=0 then a. closed 
else 
begin 

Af < f-n, 


ec, €c 
t 


Lf q<Af and q<An then 
begin 


end 
else if Af<q<An or Af<An<q then 
begin 
s_ ¢ closed 
m. + Af 
An <« An-Af 
end 
else { An<q<Af or An<Af<q } 
begin 
ss « An 
An < 0 
end © 
end 
end 
until An=0 or t=d. 
until An=0 or An has not been changed 


Part b: Graph generation. 


t<e 
f° ef 
for j « 1 ton do 
begin 
if f’=f then 
begin 
t « ttl 
bee 
generate column j 
f° «2 
ce’ <ec¢ 
sua. * t 
else 
begin . u(b) 
if z(b)=c then b <« bt2 else b ¢ btl 
if w(b)>t-1 then 
begin 
t < ttl | 
oe ie 
generate column j 
f° «2 
c’ ¢ec 
wat t 
else 
begin 
generate column j 
£° « £°+1 
if z(b)=c’ then 
ce 
begin 
mi. < m-l 
if m=0 then 
= a € p-2" () 
0% + eel 
goa * t 
end 
end 
end 
end 


The statements "generate column j" in part b 
of the algorithm mean "use the binary number b as 
column j in the binary matrix representation of 
G<n,d,e,f>". We have shown how to perform local 


optimization on each column independently, so if 
"generate column j' is replaced by "generate and 
local-optimize column j", the algorithm will pro- 
duce Gj<n,d,e,f>. 
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PIN LIMITATIONS AND VLSI INTERCONNECTION NETWORKS* 


Mark A. Franklin and Donald F. Wann 
Department of Electrical Engineering 
Washington University 
St. Louis, Missouri 63130 


Abstract: Multiple processor interconnection net- 
works can be characterized as having N' inputs and 
N' outputs, each B' bits wide. Construction of 
large networks requires partitioning of the N'*N'xp' 
network into a collection of N*N switch modules of 
data size B (B < B') each implemented on a single 
chip and interconnecting them with a specific 
interchip network type T'. The major constraint 
in the VLSI environment is the pin limitation, N 
of the individual modules; these are allocated 
between data and control lines, Q. This paper 
presents a methodology for selecting the optimum 
values for N and B given values of the parameters, 
N', B', T', Q, and N. Models for both the banyan 
and crossbar networks are developed and arrange- 
ments yielding minimum number of chips, average 
delay through the network, and product of number of 
chips and delay, are presented. A bit slice 
approach (B = 1) produces the optimum arrangement 
for the crossbar, while for the banyan the optimum 
is achieved with multiple bits per module. 


Introduction 


Over the past few years a variety of physically 
local, closely coupled multiple processor systems 
have been proposed (1,2,3,4). One key issue in 
the design of such systems concerns the communica- 
tions network used by these multiprocessor systems. 
Various studies have focused on the functional 
properties of such networks (i.e., what permuta- 
tations are possible, what control algorithms are 
needed, etc.), on their complexity, and to some 
extent on performance issues (5, 6, 7, 8, 9). 

In most cases network complexity has been measured 
by the number of elementary switching components 
needed by a network of a given size and type, while 
performance has been determined by the average 
number of elementary switching components through 
which a message must pass (i.e. average delay). 
Recently work has begun on examining complexity and 
performance questions in the context of VLSI imple- 
mentation of such interconnection networks. 
Franklin (10) has compared two networks,crossbar and 
banyan, operating in a circuit switched mode in 
terms of their space (area) and time (delay) 
requirements. The networks were assumed to be 
implemented as complete modules on a single VLSI 
chip. 

Closer examination of VLSI network implementa- 
tion problems shows that pin limitations, rather 
than chip area or logical component limitations, 
are a major constraint in designing very large | 
interconnection networks. Consider, for instance, 
a network with N' inputs, M' outputs and with each 
output being B' bits wide (N'*M'*B'). The number 


*This work was supported in part by NSF Grant MCS- . 
78-20731 and ONR Contract NO014-80-C-0761. 
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of required pin connections (ignoring power, 
ground and general control) for a single chip 
implementation is given by B'(N'+M'). For a 
square network of size twelve with B'=16, the num- 
ber of pins required would thus be 384; much 
larger than common commercially available integra- 
ted circuit carriers. Given that pins are 
typically placed on 100 mil centers along the 
periphery of the package, the total number of pins 
is limited mainly by the increase in the physical 
length of the package. For this pin placement 

and the 384 pin example, a 19.2 inch dual-in-line 
package would be required. 

In this paper we focus on two of the more 
obvious solutions to this pin limitation problem. 
The first approach is to implement a large network 
(N'*N') requiring many pins as a interconnected 
set of smaller subnetworks (N*N) where each of 
the smaller networks can be contained on a single 
chip in which the chip pin constraints are met. 

The second approach is to slice the network 
so that one creates a set of network planes, each 
plane handling one or more bits (e.g., B bits) of 
the B' wide datapath. This is commonly done in 
memory designs. A potential problem arises in 
this approach due to the difficulty in synchroni- 
zing the multiple planes. This is discussed in 
reference ll. 

The remainder of this paper deals with deter- 
mining the "best" combination of datapath slice B 
and chip network size N given: 


1. N': An overall network size (N <= N'), 


2. Bt: A data path width (B <= B'), 

An intrachip network type (e.g., 
the interconnection network imple- 
mented within the N*N chip might be 
a crossbar). 
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An interchip network type (e.g., 
the interconnection network imple- 
mented between the N*N chips to 
achieve the overall N'*N' network 
might be an Omega network). 


The maximum number of pins allowed 
on a chip. 


The number of pins on a chip 
allocated to power, ground, and 
control. 


"Best" in this context, refers to both chip count 
and bandwidth of the overall N'*N' network. 
Figures 1, 2 and 3 illustrate a general N'*N' 
network, and a possible decomposition of a sample 
16*16 network. In the next section basic models 


for this problem are presented and used to deter- 
mine the B and N combinations which minimize the 
total chip count, the overall network delay and an 
overall performance measure using the product of 
chip count and time delay. 


The Basic Model 


The basic model consists of two parts. The 
first relates to the chip count while the second 
concerns network time delay. For brevity, only 
square fully connected networks (i.e. there is a 
path from each input port to each output port) are 
considered. Note that certain input/output paths 
may have a common subpath and this may result in 
messages being temporarily blocked. 

Let us refer to the N*N*B chip as a switch 
module; a number of these modules will be inter- 
connected to realize the N' network. This paper 
considers two types of interchip networks (T'): 
the incremental crossbar, CB, and the banyan BA 
(12,13,14). While there are many ways of design- 
ing a crossbar network (e.g., demultiplexer/multi- 
plexer configuration, switched multiple busses, 
etc), the incremental crossbar design (Figure 4) 
can be expanded on a unit basis by adding basic 
switch modules in a row-column arrangement. This 
modularity property permits flexible expansion 
while retaining the nonblocking and full connection 
properties of the crossbar. A price is paid for 
these properties in terms of number of switches 
and pins required on a switch module. While the 
number of switches required per switch module may 
not be a serjous constraint with VLSI technology, 
the problem of pin constraints is severe. For the 
incremental crossbar, the modularity property 
requires 4NB data pins to implement a N*N*B switch 
module while the banyan, a blocking network, 
requires 2NB data pins. 

To make global comparisons similar and to 
eliminate blocking at the switch module level, this 
paper examinesscases in which the switch modules 
are constructed using an incremental crossbar arch- 
itecture (T = CB). Two types of module inter- 
connections are examined; the crossbar and the 
banyan (T' = CB or T' = BA). 


Chip Count Model 


As illustrated in Figure 4, the number of 
N*N*B chips required to implement an N'*N'*B' 
incremental crossbar network is given by: 


pale 


The banyan network is one of the class of 
blocking networks whose logical component complex— 
ity grows as O(N log N) rather than 0(N**2). As 
illustrated in Figure 5, the number of N*N*B chips 
needed to implement as N'*N'*B' banyan network is 


given by: 
na = f75] for] fice 


The first term in this expression is the number of 
bit slices or network planes that are required. 
The second term represents the number of chips at 
each level (row), while the third term is the nun- 
ber of levels. 


N 


me [1] 


[2] 


Time Delay Model 


A model giving the average time for a signal | 


_to propagate through a network must include the 
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time to traverse each of the chips, the time to 
propagate from chip to chip, and since the bit 
slice approach separates the bits in a data word, 
the additional time that is needed to make certain 
that all the data bits have completed their move- 
ment through the N'*N'*B' network. 

The average delay associated with a basic 
switch module will be designated as D since _ 
these modules have a crossbar construétion. Path 
setup delays (i.e., time to set switches in their 
desired positions) are not considered here. The 
delay of a pin driver and associated interconnec- 
tion wires between modules (i.e. the intermodule 
delay) is denoted by D,_. The intermodule delays 
for the CB and BA networks are different and will 
be denoted as D, and D. , _. Additional syn- 

ete : mcb, imba |, ; 
chronization delay introduced by the designer to 
assure that all data bits have traversed the net- 
work will be represented by D nd D ‘ 

For the CB network the average delay” Bea be 
determined by examining Figure 4. Note that this 
represents one of [B'/B} planes. Assume that each 
switch module, implemented on a single chip, 
represents an N*N CB network. The pin drivers for 
each module are also located on the chip. For 
this arrangement the number of modules in an 
average path is Int /N] and each intermodule path 
has the same delay D. Therefore the average 
network delay D's» is given by: 


t t , 
[x a] Day + [er] Dancy * 
Note that a circuit switched design is assumed 
here with no pipelining between modules. 

For the BA network the number of switch 
modules and the number of intermodule connections 
is log,_N'. Here, because of the connection topo- 
logy, Nhe intermodule paths are not constant in 
length. The average delay, Dd". » through such a 
network (assuming no delay penalty for blocking) 
is given by: [4] 


' eS ' ' 
D'ba [Loa [Pen 7 Los" | Dimba + Deynba 


D'. «= D [3] 
syncb 


Pin Constraints 


For a square N*N*B chip with N, pins alloca- 
ted to power, ground and control, the pin con- 
straint is given by: 


Np >= KBN + Ny. 
where K = 4 for the CB network and K = 2 for the 
BA network. The equality will be used since it is 
advantageous to utilize as many available pins as 
possible. Two cases may be considered. Case 1 is 
the situation where the number of data pins is 
much larger than N, (i.e., KBN >> N.) and thus 
equation 5 becomes: 


N = |N /KB 
= 
This is typical of a clocked system where a small 


number of clock lines are needed to synchronize 
all the data lines. 


[5] 


[6] 


Case 2 encompasses the situation where N, is 
not neglible and there is a control line overhead 
associated with the data paths.» Assuming that the 
number of control lines is proportional to the num- 
ber of ports, N, on an individual chip (i.e. N,=QN 
where Q is a constant), N can be expressed as: 


N = [s/ ua [7] 


This would be the appropriate model if network 
chips communicated with each other in an asyn- 
chronous manner and the control line overhead 

consisted of request/acknowledge pairs (Q = 2). 


Chip Count Minimization 


For large networks with large datapath widths 
and chips with a large number of pins the ceiling 
and floor functions can be removed from [1] and 


[2], and [6] and [7]. Then Nand N 
can be approximated as continuous funcfions.Assume 


that all available pins are used and consider Case © 


1 where N is given by equation 6. Substituting eq- 
uation 6 with K = 4 and K = 2 respectively into | 
continuous versions of equations 1 and 2 yields: 


ss T(NKKD KK? = 
Nopl 16BB' (N 2) /N, 2 K 43 [8] 
N _ _2B'N'log N' | a a 
bal eet Ae 
o" _N, logN, -10g2B) TogN, - 1og2B 19] 


For a given pin constraint N_, and overall network 
requirements N' and B', K nd K._ are constants. 
Minimizing N bl and Ny for this ease requires 
that B be minimized. fhe smallest datapath width 
possible is B = 1, hence with this model N should 
be selected to be N/K. This result corresponds 
to memory chip design where the slice width is gen- 
erally taken as one bit. Note however, that this 
was obtained with a continuous approximation to eq- 
uations 1 and 2; while B = 1 yields a minimum num 
ber of chips in most cases, there are situations 
where other values of B are better. For instance, 
with a BA network with N, = 60, N' = 128 and B'=16, 
a B = 1 solution yields k = 160, while a B = 2 
solution yields N. = 14hat 
For case 2 where N, is not negligible equation 
7 is used for N and substituted back into the con- 
tinuous versions of [1] and [2] to give: 


2 
N = (4B+0)7B'N'~ Be K vp 4B+Q) 
Py 16B 
a (2B+Q) 
Nua2 ~ “ba 111] 


2B (logy - log (2B+Q)) 


The derivatives of N ., and Nia2 with respect to B 
can now be taken, and the values of B and N which 
minimize the chip count obtained. 

For the case of T' a CB, the number of chips 
N b is minimized when B = Q/4. Thus for a request 
/eeknowledae pair associated with each chip data- 
path (Q = 2), B would be selected as 1. While this 
is true for almost all cases considered, the conti~ 
nuous model approximation should be checked when N' 
is less than 64 or B' is less than 16 (e.g., For N' 
= 32 N. = 75, Q = 2 and B' = 16; B = 1 yields N ob? 
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= 144: B = 2 yields Nb? = 128). 

For the case of ~°°T' a BA network, an equa- 
tion can be derived for obtaining the optimum B 
and N and indicates that the continuous model does 
not yield optimum values in many situations. 

For instance, for N. = 90, N' = 512, Q = 2 and B'= 
16, a search procedire working directly with equa- 
tion 2 gives an optimum B = 4 and yields N = 
684. Note that using B = 1 in this case results 
in N = 1152. This is not unusual, and in most 
caseR*{n < 140) where Q > 2, a choice of B = 1 
will be nonoptimal. 

Equations 1 and 2 were solved using optimal 
values of N and B. and the chip count was obtairmd 
as a function of the parameters N , N' and network 
type T', Figure 6 i11strates how the 
total number of chips varies as a function of the 
network size. Plots for two different values of 
Nand Q are also given. For a given N', N. andQ_. 
the BA requires fewer chips than the CB impfementa- 
tion and the curves agree with the observation tha 
the crossbar grows as 0(N'**2) while the banyan 
grows as O(N'LogN'). As expected, increasing Q or 
N requires a larger number of chips for both the 
banyan and the crossbar. Although not shown explt 
citly in these graphs, the optimum value of B is l 
for the crossbar (N_ > 64), while for the banyan 
the optimum B ranged from 1 to 4 Oy, > 64). 


Network Delay Minimization 


Next we determine expressions for the delays, 

D.,, D,., and D and incorporate these into equa 
eb’ 4m s 

tions and 4 to’ compute the average delay through 
the two networks, 


Crossbar Network 


The value of Dt has been developed by Frank- 
lin (10) using NMOS Ror gates for construction of 
the crossbar module and is given as: 


D. = N[2.5mfr + T(1+2.250 


b 
The parameters are defined in Table 1 which also 
gives some typical values. The equation assumes a 
circuit switched CB, and uniformly distributed ad- 
dressing of module output ports. The first termin 
the brackets represents the delay through an indiv 
ual switch within a module, while the second term 
is the delay between switches in a module. 

The delay encountered when a signal goes off 
the chip, propagates along an interconnecting path 
and enters another switch module is D, . A buf- 
fer (e.g., a series of inverters) must be included 
within the switch module to allow the minimum size 
transistor to drive the module pin and associated 
load with minimum delay. The buffer delay is det- 
ermined by the gate capacitance of the minimum siz 
transistor, the number of stages in the buffer, tle 
capacitance of the pin being driven, theapacitance. 
along the interconnecting path, and the capacitane 
of the receiving module pin. This delay is minimt 
zed when exponentially sized cascaded inverters ar 
used (14). The delay in this case is: 


ere 2 me TOe Pais . [13] 


where B eb is the ratio of the buffer load capaci- 


tance to the buffer input transistor gate @pacitance. 
The transistor gate capacitance, C_, is the capaci- 
tance per unit area times the gate”’area of the min- 
imum transistor. To determine the load capacitance 
assume that the driving and receiving pin capaci- 
tances are equal and each has a value of C_, . Fur 
ther postulate that the modules for the cBPwill be 
placed on a circuit board and interconnected via 
printed circuit copper paths. Given the planar to- 
pology of the CB, the spacing between modules will 
generally be less than one inch. Pin capacitance 
ear ies a in this case and By FC rn ati? ie, 
Pie Synchronization delay depends upon the spe 
cific design technique used to determine that all 
bits have traversed the network. Assuming selft‘imed 
design strategy, a reasonable design practice is to 
include a tolerance or guard region that is propor- 
tional to the average delay time. The average dely 
can thus be expressed as: 


Deb 7 Ki [xr7v ] (D., og D med > 4] 


where K, = 1+K_, andD, and D 5 are given in 
[12] and [13]. flumerical Studied tre shown that 
for the CB with Q = 0 the continuous form of [14] 
usually gives the same results as the discrete form 


Therefore we shall replace [N'/N]N by N'. Finally, 
using equation 7 with K = 4 gives: 
1 ea ' ‘ 
Dob has [A, + Dimeb oe /Np] [15] 


To minimize D',., 4B + Q should be minimized. With 
Q = 0, this means that D', is minimized when B =1. 
Notice that D' is diredtly proportional to N', amd 
decreases to a minimum value as the number of pins 
N increases. Consider next the typical parameter | 
values given in Table 1. For N_ large (i.e. >= 64), 
Q = 0 and B = 1, the average dePay can be approxima 
ted as: 


t « ’ | 
D' ob 6.2N' nsec [16] 


Banyan Network 


The average delay through the BA network is giv- 
en by equation 4. The value of D, is known from 
[12] and we assign a value to D Chhat is propor— 
tional to the average path delay through the. net- 
work. The only remaining quantity to determine is 
the value for D, ba’ The development follows that 
presented for the CB. ‘In this case however the sep- 
aration between switch modules in the BA is not cor 
stant, and C will vary according to the banyan 
level. Sinck Fhe number of levels required for a 
specific configuration is not known a priori, the 
inclusion of a variable for C complicates the 
delay computation. The last Pavel has the longest 
path (8s inches) and therefore the maximum capaci- 
tance. The capacitance of a typical printed circuit 
path is approximately 1 pf/inch thus the delay in 
driving this longest path is: 


Ds aba = Te log, (QC, . ,#8)/C,) [17] 


By decreasing the pin driver area as the banyan lev- 
el decreases this value applies to all levels. 
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N_,N' and network type T’. 


The average delay through the banyan network can nw 
be expressed as: : [18] 


' = t 
Dina K, [2og,.8"] INA, + te log, ((2¢ , +8)/C,)1 


The continuous version of this equation is a poor ap- 
proximation to the discrete version, thus only the 
discrete will be used. Using the values from Table 
1 the banyan delay can be expressed as: 


BA” 6.17 [togyn" | oN + 1.78) — [19] 

The discrete relations for the CB and the BA 
delays were solved using optimal values of N and B, 
and the delays obtained as a function of parameters 
The banyan delay is cor 
sistently smaller than the crossbar for networks of 
reasonable size. 


D' 


Chip Count-Time Product Minimization 


The chip count-time product P, can be obtained 
by multiplying the appropriate equations given pre- 
viously. Earlier discussion indicated that for rea 
sonable size networks, both chip count and delay 
were minimized in the CB case with B= 1. Conseqier 
tly the product is also minimized with this choice 
(for N' > 64, B' > 16). | | 

For the BA, the situation is more complex anda 
computer search for the optimum B and N values must 
undertaken. Consider the case of Q = 0 and N' =512 
Table 2 shows the values of N,B which optimize the 
number of chips, the delay, and chip count-time pro 
duct. The B and N values required for minimizing P 
fall between those needed for minimization of the 


_ chip count and delay measures by themselves. The 


count minimization is achieved by attempting to 
place as large a network as possible on a given ch®. 
Delay minimization is achieved by balancing the de- 
lays associated with the module network and the de- 
lays associated with increasing the number of level 
in the overall network. In this case placing as 
large a network on a chip as possible is not the 
best strategy from a delay point of view. Note that 
this analysis does not consider delays associated 
with network blocking which can have a significant 
effect in a saturated network. 

Values for N and B which minimized P were ob- 
tained for both network types over a range of N',N 
and Q values (Figure 7). As expected P increases 
with increasing N' and increasing Q, and decreases 
with increasing N., Once again the banyan does bet 
ter than the crosbbar on this overall performance 
measure. 


Summary and Conclusions 


This paper concerned the design of multiple pre 
cessor interconnection networks. Models for both 
the banyan and crossbar networks (T') were developed 
and arrangements yielding minimum: number of chips, 
average delay through the network, and product of 
number of chips and delay, were presented. The re- 
sults show that for the crossbar a bit slice apfroah 
(B = 1) produces the optimum arrangement, while for 
the banyan the optimum is achieved with multiple bits 
per module. The impact of the number of control 


lines on chip count, delay and product were also 
modelled. 

The analysis presented made a number of assunp-— 
tions whose effects are being further investigated. 
In particular the role of blocking in the banyan 
case, the potential gain which would accrue from a 
pipelined design, and the problem of synchronizaion 
between network planes is being studied. 


Parameter Symbol Units Typical 
Value 
minimum feature size ; 
min 5 um. 3 
minimum gate area Aint (um) 36, 
gate capacitance a pf 1,4*10 
switch module pin cap. c ,® pf 5 
pin 
transit time c nsec 0.5 
NOR gate logic levels 
per crossgate m -— 2 
NOR gate fanout f -- 2 
Metal path cap. to 
transistor gate cap. 
ratio (switch module) Oop -- 0.1 
guard region K -- 0.1 
multiplier . 
printed circuit Craeh pf lpf/inch 
path cap. 
length of longest BA S inches 12 
path 


TABLE 1: TIME DELAY PARAMETERS 


N B CHIP DELAY CDP, 


COUNT (nsed (*107 ) 


N = 60 
P COUNT MINIMIZATION 30 1 576 392 226 
DELAY MINIMIZATION 5 6 1236 168 207 
PRODUCT MINIMIZATION 10 3 936 218 204 

N. = 90 
P COUNT MINIMIZATION 45 1 384 578 222 
DELAY MINIMIZATION 5 8 824 168 138 
PRODUCT MINIMIZATION 11 4 564 237 133 


N = 120 
P COUNT MINIMIZATION 60 1 288 763 220 
DELAY MINIMIZATION 5 11 824 168 138 
PRODUCT MINIMIZATION 10 6 468 218 102 


TABLE 2: BANYAN NETWORK MINIMIZATION RESULTS 
(N' = 512, Q = 0, B' = 15) 
(CDP: Count Delay Product) 
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LINEAR RECURRENCE SYSTEMS FOR VLSI: 
THE CONFIGURABLE, HIGHLY PARALLEL APPROACH. 
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SUMMARY 


1. CHiP Overview. 


Recently, much VLSI based parallel processing 
research has focused on algorithmically specialized 
processors, highly parallel computers with fixed, 
etched-in-silicon interconnection structures tailored 


to a particular algorithm or small class of algorithms - 


[4,7,11] and many others. While these processors are 
highly successful at exploiting locality, the inflexibility 
of the rigid interconnection structure is a liability for 
problems outside their area of specialization. 


Among the more notable attempts to study a sili- 
con based general purpose parallel computing system 
is the Cube Connected Cycle work of Preparata and 
Vuillemin [8] and designs that may be based on the 
optimal area embeddings of the shuffle-exchange 
graph [5]. By providing an interconnection network of 
nearly universal permuting power these systems host a 
number of important algorithms with the property 
that if the cost of signal propagation along long lines is 
ignored the asymptotic complexity of the program is 
the same as that of the algorithm with all communica- 
tion costs eliminated. The price for such a universal 
network is paid in two ways: the surface area of silicon 
grows rapidly, O(n*/log*n), for ann processor device, 
and the lack of local uniformity makes it difficult to 
partition the system into a small number of easily 
packaged units. 


A second approach to general purpose parallel 
computing is being followed in the design of a 
Configurable, Highly Parallel (CHiP) computer [9]. The 
motivating goals are to exploit locality (i.e. permitting 
easy implementation of pipelined systolic processes as 
well as simple decomposability into reproducible com- 
ponents) and, at the same time, provide algorithmic 
flexiblility. 

The CHiP architecture has the following set of 
components. : 


1. A regular array of simple microprocessing ele- 
ments (PEs) each with a small but reasonable 
amount of local memory. 

e. <A front end controller. 


3. <A switch lattice. 


The switch lattice is a regular structure composed of 
progammable switches connected by data paths (see 
Figure 1 for an example), The PEs are connected to 
the lattice at regular intervals. The difference 
between programming the CHiP computer and a 


machine based on a universal connection network lies > 


in the programming of the switches. With the network 


' Research supporeted by NSF grant MCS-8109512. 

2 Supported in part by the Office of Navel Research 
contracts N00014-B0-K-0816 and N00014-81-K-0360 
Spec. Res, Op. Task SRO-100. 
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the programmer seeks to decompose the data flow of 
his algorithm into a sequence of stages realizable by 
the network (ie. a sequence of convolutions or 


shuffles). With the CHiP array the idea is to decom- 


pose the data flow graph into a sequence of simple sub- 
graphs that can be embedded into the switch lattice in 
a manner that optimizes some parameter such as 
locality. | 


HIS 


IM 


Figure 1. CHiP lattice. PEs shown as 
Squares, switches as circles. 


Bach switch is equipped with a small amount of 
local memory capable of storing several local 
configurations. A particular configuration setting 
enables a switch to connect two or more of its incident 
data paths directly and statically’. The controller can 
realize a new global interconnection by a system of sig- 
nal broadcasts directing each switch to select the next 
setting stored in its local memory. 


2. A Programming Example: Linear Recurrences. 
Linear recurrences arise so frequently as sub- 
processes of other algorithms that they are an 
important benchmark for exhibiting the speed and 
versatility of a multiprocessor. Following Sameh and 
Chen [1], and Kuck [6] we observe [12] that any linear 
recurrence of order q can be put in the form 


X= CC, + BX) 1 = 2,....M 
given B;, C,,i = 1,.... and X, where each B; is aq by 
q matrix and each C; and X; are column vectors of size 
q. Gajski [3] has observed that if we define a semi- 
group under the composition rule 
(C,, By) *(Cj, By) = (G+ Bi C;, BiB;) 
then the recurrence relation problem is equivalent to 


3 That is, the lattice uses circuit switching rather than 
packet switching. | 


first computing the products 
Dy; = T1(Cy, Bg) t= Run bk <i 
84 
and then evaluating 
xX = C, + BX; i= Qyeoe Md 
where (C, B,) = D,,. With n a power of 2 the standard 


computational flow graph of the required Di panes the . 


form shown if Figure 2, 


NS Boer ie 


Figure 2. Recurrence Computation Graph. 


Because standard systolic matrix multiplication 
algorithms are easy to implement on a mesh structure 
embedded in the CHiP, the programming for qg'>1lisa 
straight forward generalization of the first order case 
described below. While it is clear that the flow graph 
above can be embedded directly into a CHiP lattice of 
dimension n by log(n), this approach has two draw- 
backs: (1) for the evaluation of a single recurrence 
relation the PE utilization is low and (2) the last stage 
requires a data broadcast along a channel of length 
that grows linearly inn. 


A more careful programming of the switch lattice | 


can improve this situation. Collapse the flow graph by 
rows and assign each row of the computation to a sin- 
gle PE. In this approach the program consists of 
log(n) stages where at stage i processor P;, with 
je = 2-1 +k2*, k =1,....n/2' will broadcast Dy to 
processors P; 4, t = 1,..,2*. The main problem is to 
assign the logical. processors to the physical PEs in a 
n'/2 by n'/* CHiP lattice. While many solutions are 
possible one method that makes good use of locality 
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Figure 3. Switch Lattice Settings For n = 16. 
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and has a straight forward control structure is shown 
in Figure 3. The processors are numbered in the 
+c Wieland indexing of Thompson and Kung 
11]. 

The broadcasts in each stage are completed with 
“hyper-H” fan-out trees in which no edge is of length 
greater thann!”*/2. If one assumes that the propoga- 
tion time of an nMOS signal is linear in the distance 
traveled then the time to complete the recurrence 
computations will be of the form clog(n) + dn’? for 
constants c and d withd <c. Inthe special case that 
the recurrence relation represents binary addition the 
lower bounds of Chazelle and Monier [2] show that the 
scheme above is optimal in area and time. Further- 
more, if initially the data resides outside the CHiP 
array then the 1/0 complexity is bounded by the n!/? 
perimeter size. Consequently, one has an asymptotic 
match between computation and I/O complexity, 


3. References 


[1] S. C. Chen, D. J. Kuck, and A. H. Sameh, "Prac- 
tical Parallel Band Triangular System Solvers”, 
ACM Trans, on Math Software, Sept 1979, Pp. 
270-277. 


[2] B. Chazelle and L. Monier, "A Model of Compu- 


tation for VLSI with Related Complexity. 
Results", STOC, ACM, Milwaukee, 1981, pp.31B- 
325, — 


[3] D. D. Gajski, "Solving Banded Triangular Sys- 
_ tems on Pipelined Machines", Proceedings of 
the Int’l Conf. on Parallel Computation, IEEE, 

1979, | 


[4] L. J. Guibas, H. T. Kung, C. D. Thompson, 


"Direct VLSI Implementation of Combinatorial 
Algorithms", Proc. Conference on VLSI Archi- 
tecture, Design, and Fabrication, Calif. Inst. of 
Techn., Jan 1979. 


[5] D. Kleitman, F. T. Leighton, M. Lepley, G. L. 
Miller, "New Layouts for the Shuffle-Exchange 
Graph", STOC, ACM, Milwaukee 1981, pp.276- 
292. 


[6] D. J. Kuck, The Structure of Commiters and 
Computations, Vol. 1, John Wiley & Sons, Inc., 
New York, 1978. 


[7] . H. T. Kung, C. E. Leiserson, "Algorithms for 
VLSI Processor Arrays", C. Mead and L. Con- 
way, J/ntroduction to VLSI Sytems, Addison- 

| Wesley, Reading, Ma., (1980) pp. 271-292. 

[8] F. P. Preparata, J. Vuillemin, "The Cube- 
Connected-Cycles: A Versatile Network For 
Parallel Computation", Proceedings of the 
20th. Annual Conference on the Foundations of 
Computer Science, pp.140-147, 1979. 


[9] L. Snyder, Introduction to the Configurable, 
Highly Parallel Computer, Dept. of Computer 
Sciences Purdue University, Technical Report 

_  CSD-TR-351, May 1961. 

[10] L. Snyder, D. Gannon, "Linear Recurrence Sys- 
tems for VLSI: The Configurable, Highly Paral- 
lel Approach."”, in preparation, 1981. 

[11] C. D. Thompson and H. T. Kung, "Sorting on a 
Mesh-Connected Parallel Computer", CACM, 
v.20, no 4, pp.263-270, 1977. 


EMBEDDING A TREE IN THE NEAREST NEIGHBOR ARRAY 


Amar Mukhopadhyay and Ratan K. Guha 
Department of Computer Science 


University of Central Florida 
Orlando, Florida 32816 


Introduction 


The binary tree is an important interconnec- 
tion structure for hierarchically organized par- 
allel machines. With the advent of VLSI, consid- 
erable interest has been generated in dirct hard- 
ware implementation of a tree. However, the 
parameters for VLSI are quite different from those 
used in traditional design. Since a large amount 
of ciruitry must be packed in a chip, the total 
area of the chip is an important design parameter. 
The cost of communication rather than computation 
is also a major factor in VLSI design. This cost 
is reflected not only in the area occupied by the 
interconnection paths within the chip but also by 
the requirement of providing additional communica- 
tion channels with the external world via the 
input/output pins, which cannot be arbitrarily 
large for VLSI chips. For example, in the so- 
called "H" layout (Mead and Rem, 1979) of a tree, 
a large fraction of the leaf nodes are within the 
chip and communication to the external world needs 
additional area. This also increases the signal 
propagation time between any two nodes to 0(vn) 
rather than being the natural O(logn) value for a 
tree, since the nodes at higher levels of the tree 
use longer and longer wires for interconnection. 
Another important design parameter for VLSI design 
is known as the "regularization factor" (Lattin, 
1979) which measures the regularity of the embedd- 
ed structures in the chip. Intuitively, it means 
that the number of "templates" of interconnection 
and "cells" for computation should be kept as 
small as possible in defining the layout of the 
chip. 


In this paper we consider the problem of lay- 
ing out a complete binary tree with n leaves on a 
chip and show that for practical size trees in 
which the nodes correspond to processors and the 
connection between two nodes correspond to commun- 
ication paths between the processors, the proposed 
solutions represent the best known area efficient 
layout with maximum availability of the leaf and 
root nodes at the boundary and low communication 
cost. We show that this could be done on a near- 
est neighbor array with three basic connection 
templates. The problem of laying out a tree on a 
plane of minimal area has recently been researched 
extensively [Leiserson (1979), Valiant (1981), 
Browning (1980), Locanthi (1980), Brent and Kung 
(1979) and Krishnan (1981) ]. 


The Model of the Silicon Surface 


The proposed model of the two dimensional 
surface has the following attributes: first, each 
processor takes a unit area on the surface; second 
the processors are connected in the nearest neighbor 
array uSing the interconnection templates, as 
shown in Figure l(a), (b) and (c) and their rota- 
tions by 90°, 180° and 270°, where Po is the parent 
processor and P, and P, areits sons. Intemplates 
(a) and (b), the parent processor is directly 
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connected to both its sons and we will say that 
the communication between them will be provided 
by implicit wiring. The necessary area devoted 
for this purpose is some fraction of the unit 
area located at the interface. In template (c), 
the Pg is directly connected to Py which in turn 
is directly connected to P,. Pg and Pj are con- 
nected to each other via Po. 


Each processor has two or three sets of in- 
ternal and/or external ports. Any information 
sent to Py by Po must be retransmitted to P, if 
the information was directed to both P, and P92 or 
P, only. Any information sent by P; to Py is al- 
ways transmitted to Po by adding a bit to it so 
that Pg knows that P; is transmitting to Pg. 

When Pp is transmitting information to its sons, 
it should indicate which son is supposed to re- 
ceive it. When Po receives information from one 
of its sons, it should check who sent the infor- 
mation. Finally, if a wire is used to communi- 
cate between processors, the width of the wire 
will be 1/£, when f21. Such wiring will be call- 
ed explicit wiring, as shown in Figure 2 between 
processors X and Y. 


The model differs fromthe previously studied 
models in distinguishing between the explicit and 
implicit wiring and also in not assuming that a 
wire has unit width. The communication between 
Pg and Py; in template (c) requires more time com- 
pared to time taken for direct communication, 
which has to be taken into account in synchroniz- 
ing the computations performed by the tree. The 
implementation of the necessary protocols seems 
straightforward in terms of both additional hard- 
ware and software. 


An alternate communication geometry will 
consist of 8-neighbor array interconnection in 
which each processor can directly communicate with 
four nearest neighbors and four nearest diagonal 
neighbors, as shown in Figure 7. The regions 
shaded at the boundaries of the cells denote the 
areas devoted to build communication channels 
between cells. 


The "Ideal" Tree Layout 


In this section, we prove a result of aca- 
demic interest which apparently contradicts the 
result obtained by Brent and Kung (1979). Assume 
that each processor needs the same area indepen- 
dent of its shape and that the communication be- 
tween the parent node and its sons takes place by 
implicit wiring. Then, a binary tree can be laid 
out in annular zones of an expanding circle or 
square, as shown in Figures 3(a) and 3(b), respec- 
tively. Let r; denote the radius of the ith cir- 
cle or the side of the ith square. Then, obvi- 
ously, we have the relation 


a Do ee 
My Tye 7 285-178 4-2) 
Thus, ; 
r,=/2i-1, fee Os seins 


Thus, a k-level tree with n= 2K] nodes can be 

laid out in an area of exactly n units with all 
the leaf nodes being accessible at the boundary. 
With increasing n, the boundary processors have 


thinner and more elongated shapes compared to 
those near the center. This inhomogeneity in 
shape makes the scheme practically unrealizable 
and constitutes the key point of difference be- 
tween our result and that obtained by Brent and 
Kung where the "aspect-ratio" of each processor is 
assumed to be l. 


The Proposed Layout Scheme 


Even if we can live with the arbitrary aspect 
ratio of the "ideal layout,'' the realities of pin 
limitations will restrict layouts of trees beyond 
depth k=9 or n=511 in the most optimistic case. 
Current packaging techniques allow a maximum of 
about 120 pins on a chip. A factor of 4 increase 
puts an upper bound of about 500 pins on a chip. 
Thus, if we assume that all communication to a 
processor chip should take place via a single pin, 
serial port in a "message-switching" mode, the 
maximum size of the tree will be limited to n=5ll. 
However, if accessibility to leaf nodes is not a 
requirement, bigger size trees could be laid out 
by restricting the communication through the root 
node with the penalty of slow communication. 


The layout algorithm will consist of special 
cases for n=3, 7, 16, 31 and 63 and a general al- 
gorithm for n2127. The layouts for n=3, 7, 16 and 
31 are shown in Figure 4. Note the trees can be 
laid out using only implicit wiring and all the 
leaf nodes and the root node are accessible at the 
boundary. For n=63, the tree can be laid out in 
almost an implicit wired form, as shown in Figure 
5, which needs 16f explicit wiring and an addi- 
tional 24f units of area over the densely packed 
layout, with all leaf nodes and the root node made 
accessible at the boundary. For n=127, 255 and 
511 above, the floor plans are obtained by follow- 
ing the general layout algorithm as described be- 
low and illustrated in Figure 6 for the tree WXYZ 
with n, = 2k_] from the pair of trees ABCD and EFGH, 
each with Ny} = 2 -l_] which will form the two 
subtrees of the root r of the tree. Assume that 
each of the sides AB, CD, EF and GH contains Chiy 
external connections; each of AD and EH contains 
Y,~-1 external connections; the external conections 
on the sides BC and FG include connection for the 
root and contain ry,.,)+l external connections and 
are denoted by Cees 


The sides BC and FG of the subtrees are 


l. 
aligned so that the connection to their roots 
can be connected to the root node r of the 
tree. 

2. The connection of r to the roots of the sub- 


trees partitions the external connections at 
BC and FG into two classes, each having an 
equal number ry,-1/2 connections: those that 
can be routed upwards for external connection 
and those that can be directed downward for 
external connection. The connection to the 
root node r can be brought out either to 
downward or upward direction. 


In the resulting tree WXYZ, the side YZ con- 
tains the external connections for the root r. 
The external connections for XY and WZ are given by 


ae 


262 


The external connections for WX and YZ are re- 
spectively given by 


Ty = 20, ty 
Te = 2¢, tt, 471 


If the length of a side of the subtrees (viz. 
length of BC) is h, the total area.A of the lay- 
out can be expressed as 


AC2-1) = 2a(s8"!-1)+he(r,_j+1) 


and total average length of explicit wiring W can 
be expressed as 


w(2S-1) = 2w(25"*-1)4r, (er, tht2£)+h/2 | 


k-1/2 
A comparison of the proposed layout with those 
obtained by "H" method or Krishnan's method in 
terms of total area, explicit wiring cost, signal 
propagation time and accessibility will be in- 
cluded in a detailed paper under preparation. 
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Summary 


As technology advances, the trend has developed not only for 
_ more functionality per system but also for more hardware. 
Demands for higher levels of performance and throughput are 
also present. These demands are pushing hard on centralized von 
Neumann computer.architectures which are already optimized to 
near their ultimate physical limits. Solutions in such architectures 
take the form of increasingly exotic high-technology components 
and structures. This makes the incorporation of redundant 
elements necessary for enhanced reliability ever more costly. 


As the quantity of circuitry per system increases, the 
probability of failure also rises. Use of newly-developed materials 
and devices (e.g. GaAs, Josephson effects) as well as the 
continuing micro-miniturization of existing technologies is 
making the typical gate less reliable. In order to maintain 
acceptable levels of reliability, major computer suppliers are 
designing redundancy into their equipment. For example, the 
IBM 370/168 uses parity-prediction in the adder, as well as parity 
protection on all data paths. Error detection and correction 
circuitry makes up over 15% of the 370/168. Because of the 
considerable cost of this additional circuitry, it is. clear that overall 
system reliability has become a very important issue. 


Thus, two opposing forces are acting on computer designers. 
The demands for ever higher performance and functionality are 
causing more exotic and expensive circuitry. At the same time, 
the use of more circuits per system is causing overall system 
reliability to decrease. Centralized architectures are being forced 


into becoming either less reliable or more costly. It is now time to. 


consider other solutions. 


There are two potential avenues of approach to the problem, 
use of non-centralized architectures and adaptation of reliability 
enhancement techniques. This paper explores the issues of a 
combined approach using a distributed architecture and a 
modified fault-masking method. The objective is to find a 
realistic, effective solution to the conflicting design demands 
while retaining acceptable levels of reliability and cost. : 


As an alternative to the centralization-induced bottlenecks of 
~ von Neumann machines and a possible avoidance of difficulties 
with SIMD and MIMD forms, an’architecture known as data flow 
was introduced [7, 3, 5]. The data flow model of computation 
utilizes a directed graph to: depict the transformations and 
dependencies of the user program. It features an activation-by- 
availability cxecution model which can exploit virtually all 
concurrency in the user problem. The architecture lends itself to 
a totally distributed implementation with many copies of the basic 
processing elements sharing the load. As such, each individual 
element can be much simpler. Performance is no longer the single 
dominating design issue since additional throughput can be 
achieved at the system level by adding more parallel units. 


Such an architecture provides exactly the sort of relief that is 
needed for designing reliable, yet cost-effective digital systems. 
By solving the performance problem at the architecture level, we 
make room for addressing the issues of reliability at the package 
level. Based on ideas from data flow, we offer a design style that 
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can be used to construct reliable and practical subsystems, © 


There are several effective techniques for enhancing the 
reliability of digital systems . A commonly-used fault-masking 
technique is triple modular redundancy [8]. Because of costly 
triplication, TMR (as depicted in Figure 1) is normally too 
expensive to consider for most applications. .By using a 
distributed architecture, we can consider adaptations of this fault- 
masking technique which can be practical and cost-effective even 
for modest reliability requirements. 


Voter Outputs 


Figure 1: Triple Modular Redundancy (TMrR) of a 
Byte-Sliced Module 


In this work we use TMR, but make a basic trade-off of time 
for space. We do not require triplication of all circuitry. Rather, 
we advocate slicing the simplex form of the system to create n 
identical byte-wide slices. Triple redundancy is then achieved by 
multiplexing inputs to the slices in time (as shown in Figure 2). 
The price paid for our trade-off is up to a factor of 3 in time. We 
remark, however, that the presence of asynchronous portions of a 
design may allow a significant fraction of this time penalty to be 
overlapped. Thereby, we get the full fault-masking capability of 


_ TMR without the high cost of triplication. The majority of the 


system redundancy is in the time domain. 


Inputs Sliced Module Voter Outputs 


Figure 2: TTR Implementation 


Because we have eliminated physical triplication of the 
module, there is considerably less equipment in the system (as 
compared with TMR). Less equipment means less circuitry to fail. 
Because they are simpler, the sliced circuits should individually 
be more reliable. We show in[2] that the reliability and 


availability of our fault-masking scheme arc potentially superior 
to those of TMR. 


In [2] we define several basic elements and rules of 
composition for constructing a fault-tolerant digital subsystem. 
Throughout this work, as in TMR systems, a single fault model is 
assumed. The basic building blocks are slice elements (with 
arbitrary unidirectional carries), transformation elements, and 
memories. The subject of fault tolerant memory design is not 
addressed. Use of a Hamming SECDED code or equivalent is 
~ assumed. 


For sliced (S-)elements we use triple time redundancy, TTR, a 
generalization of a method used in [6]. In three successive TTR 
steps, a pump circuit presents sets of inputs to each slice. Each set 
is rotated by one byte width. Thus, a slice produces its own 
output and that of the two immediately previous slices. Use of 
pipelining allows overlap of some TTR overhead. A sequential 
majority voter element is used at the end of pipelines to restore 
correctness in the presence of arbitrary single slice failures. 
Specialized elements for subsetting, splitting, joining, and 
(de)multiplexing S-element streams are defined in [2]. Some of 
these elements perform necessary reanranecene of outputs 
during TTR phases @, and M,. 


For so-called je aetoneibn (T-)elements (i.e. unsliceable, 
full-width functions), the problem of preserving correctness in the 
presence of single faults is somewhat more difficult. It has been 
shown [1] that by utilizing the regular structure of PLAs 
concurrent error detection can be accomplished. This immediate 
error detection capability, (along with simple duplication) can be 
used to render a T-element single fault-tolerant. Such elements 
form the basis for reliable finite state machines. 


A fault-tolerant clock scheme [4] is used in our system. In this 
scheme three independent oscillators are voted by three majority 
voters. The voter outputs are fed back to the oscillators for 
synchronization and also are distributed to interleaved slices so 
that a failure in a voter will be maskable via our normal slice 
voting methods. 


Since some specialized S-elements must recognize the TTR 
phase currently active, we provide paths in all elements for clock 
tokens. The presence of a logical one at a pipeline stage and 
particular clock token line specifies the phase currently active in 
that element. This technique has built-in redundancy which can 
be monitored if desired, i.e. at most one of the three token lines 
may be asserted in a given S-element at any instant. 


The series of pipelined S-elements between a pump element 
and a voter is called, inclusively, a segment. Each segment is 
strictly synchronous with flow controlled by the voter module. 
Each voter has an ENABLE line by which its successor controls the 
outflow of data. Whenever the successor is not ready the ENABLE 
line is held false. This causes the voter to continue sequencing 
until @. and then stop the pipcline. 
pipeline is sent back to the pipeline’s predecessor element as its 
ENABLE line. Thus, data in a segment can “pack up” when the 
voter outflow becomes blocked. The travelling clock tokens keep 
track of logical phase in our system. 


The pipeline control in splitters, join elements, and 
(de)multiplexers is straightforward, e.g. the predecessor ENABLE 
line in a splitter (a one-to-many structure) is the result of an AND 
of its incoming predecessor ENABLE lines. Details for all element 
types and a complete design example are presented in [2]. 


The example subsystem is a simplified processing element for 
the cell block data flow architecture [5]. A four word instruction 
packet in TTR form is assumed for input. Two result packets are 
produced. The design is a multi-pipeline aggregate comfortably 
supporting instruction packet rates in excess of 1 mip. The time 
limitation in our example is the speed achievable in the packet- 
transferring S-elements. A crude estimate of the space overhead 
as comparcd with a non-fault-tolerant design is 200%. The timing 
and packaging issues are studied and presented in [2]. Due to lack 
of space, we refer the reader to the report. 


The signal to stop the 
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A computation of the reliability of our fault-masking 
technique has been performed. Figure 3 gives comparative plots 
of reliability versus normalized time for simplex, TMR, and TTR 
systems. ‘The computation is based upon,a constant module 
failure rate (Poisson) model where p = @“*. For n or ws 
independent slices, this corresponds to a siice failure rate of 
Since the slices are roughly Awe as complex as an mlved 
simplex system, we use p = & 
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Figure 3: Comparative Reliability of TTR, TMR and Simplex 


The technique compares favorably with TMR systems, both in 
terms of cost and reliability enhancement. As in any fault tolerant 
system, there are critical regions upon which the reliability of the 
entire subsystem depend. In our technique, these regions are the 
storage nodes of voters and the parallel portions of the 
interconnect. These represent a small portion of the overall chip 
area and can be rendered more reliable through the use of 
conservative design layout rules. 
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Abstract -- Synchronous Nets, or S Nets, are 
developed as a modeling tool particularized for 
describing processes on Single Instruction 
Stream - Multiple Data Stream (SIMD) computers. 

S Nets are a modification of Petri Nets, using 
transitions and places to model events and con- 
ditions. However, S Nets introduce vector-mask 
places to model the conditions of the array 
resources of SIMD machines. These places are 
distinguished from scalar places which model the 
scalar resources. S Nets also introduce a new 
kind of transition. 
Petri Net transition, but the mask firing transi- 
tion is particularized to the SIMD environment, 
modeling the inherent capability of a computation 
executing on a SIMD machine to alter the partici- 
pation of the vector aggregates in successor 
events. 


Introduction 


This paper is concerned with the problem of 
mapping algorithms onto certain classes of paral- 
lel processors to exploit parallelism in the 
algorithm to the maximum extent supportable by 
the machine on which it is to be implemented. 

The approach taken is one of providing a tool -- 
a graph-based modeling system called Synchronous 
Nets or S Nets -- to describe such an implementa- 
tion. 


SIMD Architectures 


_ The processors of interest are of the single 
instruction stream-multiple data stream (SIMD) 
architectures as described by Flynn [1] who 
distinguishes four classes: . 


Single instruction stream - single data 
stream (SISD) 


Single instruction stream - multiple 
data stream (SIMD) 


Multiple instruction stream - single 
data stream (MISD) 


Multiple instruction stream - multiple 
data stream (MIMD) 


Figure 1 illustrates an SIMD architecture, 
which typically consists of a control unit with 
its own memory, and (possibly) some limited pro- 
cessing capability; an array or vector unit con- 
sisting of N Processing Elements (PEs) and at 
least N memories (PEMs); and an interconnection 
network for interprocessor communication. Asso- 
ciated with each PE is some indicator (mask) for 
Signaling participation or non-participation in 
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instructions. In implementation, usually a con- 
ventional sequential machine is attached to the 
control unit, i.e. a mini-host. 


A Multiple-SIMD (MSIMD) architecture is con- 
figured as two or more independent SIMD machines, 
each with its own control unit, array unit, etc., 
and with one interconnection network. These SIMD 
components have the ability to perform synchron- 
ously, and using the same instruction stream or 
different instruction streams. Such an architec- 
ture is illustrated by Figure 2. | 


SIMD machines are considered "special pur- 
pose."" They perform spectacularly on problems to 
which they are well suited [2, 3,°4, 5]. To 
derive high performance, the application should 
have a high degree of parallelism, with the 
algorithm consistent with the topology of the 
machine. Unfortunately there are no simple means 
to gauge this desired isomorphism. Modeling is 
one approach that can be employed. S Nets were 
specifically developed to accomplish and facili- 
tate this, and are a modification of Petri Nets 
[6, 7, 8, 9] supplying constructs particularized 
to SIMD (and MSIMD) architectures. 


Definition of Synchronous Nets 


System Overview 


A Synchronous Net, or S Net, is a directed 
graph with a marking and a set of descriptors 
[10]. The vertices of the graph are vector-mask 
places, scalar places, and transitions. Scalar 
places and vector-mask places are connected with 
arcs to transitions and vice versa. A marking 
associates a non-negative integer with each 
scalar place, and associates a tuple of non- 
negative integers with each vector-mask place. 
The non-negative number is called the number of 
tokens. Descriptors are associated with each 
transition and characterize the behavior of the 
transition. | 


As with Petri Nets S Nets use transitions to 
model events and places to model conditions with 
arcs representing the paths allowed for passage 
of control. Analogous to Petri Nets S Nets 
exhibit dynamic behavior resulting from the firing 
of transitions. The firing of a transition models 
the occurrence of an event; tokens in a place can 
model the holding of a condition. 


The key differences between Petri Nets and 
S Nets are the S$ Net tnnovations of vector-mask © 
places and mask firing transitions. Vector-mask 
places model aggregates of logically associated 
and homogeneous conditions whose initial and 


ceasing events are synchronized, i.e., the condi~ 
tions of a set of array processors, These aggre- 
gates are further characterized by the fact that 
the marking of some members of the aggregate may 
be relevant to the firing of a successor transi- 
tion while others may not. This characteristic 
can model the participation or non-participation 
of some elements of the array processor in sub- 
sequent events. 


Unique to S Nets is the concept of two kinds 
of transition firings -- one of which — the mask 
firing — provides for alternatives in the 
markings of the aggregates. This enables 
modeling of changes in the participation or non- 
participation of the elements of an array pro- 
cessor as it proceeds from event to event. 

These alternatives are formalized by descriptors 
associated with each transition. 


S Net Graphs 


S Nets will be defined in terms of sets. 
The element of a set will be designated within 
{ }. The CARDINALITY of any set shall be 
designated and refers to the number of 
elements in the set, i.e., S| represents the 
number of elements in S. For example, 


if $= {sy, Sp9+++85, then ls | = j. 


Also important in the S Net definition is the 
notion of tuples denoted by < > and consisting 
of ordered components. The cardinality of a 
tuple is also designated | |, but it is more 
appropriately called its DIMENSIONALITY. 


An S Net Graph will be a quadruple (T,S,U, 
A), with an initial marking Kp and a set of 
transition descriptors D, where: 


T = A finite set of transitions 
{t,, tyres Ey de 
S = A finite set of scalar places 
{sy, aie 
U = A finite set of vector-mask places 
{<Vy, Mp>, <Vo, Mo? s+ -<Vigyy Mig)? 
A = A finite set of directed arcs 


{a,; Agre+0A » such that 


jal” 
A & (PxT) \ (TxP), where P = U (J S and P is 
called the set of places. 


Thus the elements of A are of the form <p., ty? 
or <t:, Py» so that an arc either connects a 
place to a transition or a transition to a 
place. 


The set U is defined as a subset of VxM, 
where: 


V = A finite set of elements called vector places 
{V1, Vossen hi git each element of V, desig- 


nated Vi> is a tuple containing some number 
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of ordered components, i.e., 
ip”? p > 1, and Iv, | does 


necessarily equal Iv, | when i + j, but 


V; = <Va4? VaggoeeeV 


vj = |v 


A finite set of elements called masks 
{M, » Moses its each element of M, 


designated M., is a tuple containing some 
number of ordered components, i.e., 
M 


= <Myz> Myyoee-MyQ?s > 1, and Im, | does 


i iq 


not necessarily equal IM, | when i + ar 
but |M| = lu]. 


If Pis UW S, and A is as defined, the 
triple (P,T,A) is a bipartite directed graph 
since all nodes can be partitioned into two sets, 


transitions and places, such that each arc 


directed FROM an element of one set is directed 
TO an element of the other set, and vice versa. 
Therefore arcs from(to) a vector-mask place or a 
scalar place are always directed to(from) a 
transition. 


In the S Net Graph, transitions are repre- 


sented by , the scalar places are represented 


C) , and the vector-mask places by 


symbol Vs is ° 
3 or . The dimensionality of V; 


is portrayed as A : 


by 
Within that last symbol, the vector 


and the mask symbol M, is 


The dimensionality of M, is not noted on 
the graph, but is specified in the formal 
designation of M; components, i.e., 

pare aoe ie Arcs are 


denoted as ——s== 


Given the S Net shown in Figure 3, we shall 
delineate the graph of the S Net as follows: 


T= {ty, ty, tg, tys ty, te} 
Se {s1, So» s3} 
V= {V,, Vy, V3, v,} 


Vi = SViq> Yy9° %43 

Vo = SVoq? Yoo? Y237 

oe 

VA = Var? Yao? Yy3 

M = {M,, My, Mg, M,} 

My = <myz> ™jp> M57 

M, = <M54> Moo» My 37 

M3 = <M31>5 M39, 33> 

M, = <M@gy> Myo> Mg” 

— (vy, My>, <V2, My>, <Vz, My>, <V,, M,>} 
& ts {<s1, ty>s <ty> <Vj> My >>, <<Vy>, My>, to>s, 


<to, <Vo, Mp>>, <<Vp, My>, t3>, <t3, S5>, 
<tyz, <V3, M3>>, <<V3, Mg>, ty>, <ty, <q, 


My>>s <<Vy, My>s t3>s <8, ty>, <tg, s,>, 


<So> t67> <t¢, s3>} 


S Net Structure 


Analogously to a Petri Net, the structure 
of an S Net is defined so as to make clear the 
relationship of places and transitions. The 
INPUT PLACES of a transition f(t) are all scalar 
places and vector-mask places directed immedi- 
ately TO the transition. The OUTPUT PLACES of a 
transition 0(t) are all scalar places and vector- 
mask places directed immediately FROM the 
transition. 


Markings 


The infinite set of non-negative integers 
{0, 1,...} is designated N; the set of Boolean 
numbers {0, 1} is designated B; the set {0} is 
designated Z. Also the r-fold Cartesian pro- 


ducts are defined: 
NX = NxNx...N; each member is of the form 
<N)> Noa-+-N > 
Br = BxBx...B3; each member is of the form 
<B,, Bo»---BL> 
Zr = ZxZx...Z3; each member is of the form 


<0, 0,...0> 


where N, and B;, are elements of N and B, 


respectively. 


A. MARKING is a function K where 
lv, | 
S>*N; V; 7 N for all Vv; € V; 


[My | 


K: 


M; | for all M; € M. 


The marking associates a non-negative integer 
with each scalar place — K(s) for eachse S — 
and two vectors of non-negative integers with 
each vector-mask place, one of those associated 
vectors being a Boolean vector — K(V; ) for each 
V, € V and K(My ) for each M, eM. 


A marking for an S Net must pacity all three 
compenceess 


An INITIAL MARKING Ko is defined as the first 
marking of the S Net. | 


A MASK MARKING for a mask M; is a function K 
[My | | 
such that M; > B ‘ 


The set of possible mask markings for any M; is 
W(M;) and denotes the co-domain of a mask marking, 


Ma | or if 


ma | 


consisting of designated tuples of B 


appropriate, the entire product set B 


Notation for Markings. The convention 
adopted to show markings will be that of ( ) and 


< >. The former is used to distinguish the 
marking of a single element or component, and the 
latter is used when more than one element or com- 
ponent is involved, thereby denoting an ordering 
with respect to markings of elements or 
components. 

As an example, markings for places 815 S55 83 
are designated K(s}) = (0); K(s5) = (0); 


K(s3) = (0). 

1 a {v,; Vo}, and if Vy = <V44> V9? V137 
then K: V, > z3 
K(V,) = <0, 0, 0>; alternately K(v,4) = 


is equivalent to: 


(0); 


K(v, 5) = (0); K(v, 3) = (0). Similarly, for 


>, a marking K(M ) =.<1, 0, 0> 


Lie yo a 
designates that K(m,4) = (1), K(m, 5) = (0), 
K(m,5) = (0). 


M, = <m 


Graphic Portrayal of Markings. Markings are 
illustrated with the presence of tokens. Dots in 
any place represent tokens. Tokens in masks may, 
alternatively, be represented by Boolean symbols 
for legibility. 


The symbol for M, shows a token in 


i 


M54 and Ms9- This is synonomous to the symbol 


© 
© 
10 


Using the S Net example of Figure 3, Figure 4 
illustrates a marking where masks are marked with 
tokens but vector places are not marked with 
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tokens. This example assumes the initial marking 


is: 

Kg(s1) = (1)s Kp(s5) = (0) Ky(s4) = (0) 

Ko (Vy) = Ky(V.)= Ky (V3) = Ky(V,) = <0, 0, © 
Ky(M,) = Ko(M)) = <1, 0, 0» 

KyMq) = <1, 1, b> 

Kg(M,) = <0, 0, b 


An assignment of tokens to a vector place V3 may 


leave some of the component places marked with 
tokens and others empty, i.e., all elements v 


iJ 
of V; may not have tokens. Since the lv, | may be 


large, graphic designation of which components 
are marked must necessarily be limited. For 


e 
components; then ; 


2) 


€ Vi are marked. 


O 


and : ‘ 


example, a portrays a Vy with three 


indicates that two 


V. 


4 Synonomous are the symbols 


However only 


i; are marked but does not 


distinguish the individual elements, nor does it 


indicate how many tokens are in each marked Vag: 
However S Nets use vector-mask places to model 
conditions resulting from and leading to events, 
and V; is always expressed graphically in con- 


junction with M;. It is My which will be used 


graphically to enhance comprehension of which Vij 
are marked — at least in markings resulting from 
an execution of the S Net. 


conveys that two Vij 


Rules for Execution 


The graph and structure of S Nets have been 
addressed in previous Sections. Now discussed is 
the dynamic behavior of S Nets. | 


Enabled Transitions. A scalar place is 
HOLDING if it has at least one token in it. 
vector-mask place <V;5 M,> is HOLDING if: 


A 


at least one K(m, 5) = (1), j =1, Dowae Ve ts 


and v 


13 
those j for which mz € M; has a non-zero marking. 


€ V5 has a non-zero marking for all 


A holding for a vector-mask place is in con- 
trast to a marking of that place. Whereas a 
marking associates some set of integers with 
vector-mask places, a holding for a vector-mask 
place REQUIRES that the components of V; be 


marked with tokens everywhere that their 
associated M; components are marked with tokens. 
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A transition t is ENABLED also called FIRABLE 
under the following conditions: t is ENABLED if 
all scalar places in I(t) are holding and all 
vector-mask places in I(t) are holding. 


Firing Transitions. A FIRING is a function 
of a transition which has for its domain and range 
the marking of the input places and output 
piaces of the transition. There is a firing 
associated with every enabled transition t. When 
a transition t is enabled, its firing function is 
defined at a given marking Kn of the S Net, and 


the firing yields K,4,, a new marking. 


Transition Types. A TRANSITION TYPE speci- 
fies the firing capabilities of the transition — 
either simple or mask firing — designated SFT and 
MFT respectively. 


Transition Descriptors. A TRANSITION 
DESCRIPTOR p[t] specifies the transition type, 


either SFT or MFT, and for every vector-mask out- 
put place <V;, M;> of the transition, specifies 


W(M,;), the set of markings for M;. Descriptors 


for a transition t with vector-mask output places 
<V4 5 M,> > Was Mim, 2-6 <V M,> are specified: 


Yr? 
D[t] = [types K(M,) © WM), 
K(M;) © W(M;),...KQ,) © W(M,)]. 


Rules for a Simple Firing. A SIMPLE FIRING 


associated with an enabled transition t is such 
that: 


For every scalar input place s, then 
Kit 6S) = K(s) - 1 
For every scalar output place s, then 
Kat (3) = Ks) + 1 


For every vector-mask input place <V;, M;?, 


then: lv. 


for v4, © Vy, j= 1, 2,... ols 


K +l (Vi4) = K, 44) - 1 for those j for 


which mi5 € M. has a non-zero marking; and for 
maj e M;> 
Kay 5) = K (43) for all a 


For every vector-mask output place <V,, M;?> 


then: for Vij NGS. 1 ay ee rae 


1 


K+ v5) = K (v44) + 1 for those j for 


which Mi; EM; has a non-zero marking; and for 


mM. 


ij E Mi, 


K+ (m5) = K, (mj) forall j. 


As seen from the firing rules, SFTs do not alter 
their input or output masks. 


Rules for a Mask Firing. A MASK FIRING is 


associated with an enabled transition t that 
has at least one <V,5 M;> output place, and 


is such that: 


For every scalar input place s, then 
K 9 (6) = K, (s) - 1 
For every scalar output place s, then 
K+] (8) = K, (s) +1 


For every vector-mask input place <V5. My>, 


L 


for v;: 


then: ij 


SV ig a Ly De cce VG | 5 


Kat1 (v45) = Ky (v4) - 1 for those j for 
which mj; € M; has a non-zero marking; and for 
ay eae 

Ka+l (m; 5) = K (m,5) for all j. 


M.>, 


For every vector-mask output place <V,, + 


then for M;, 


Kl (M,) € WM), where WQM; ) is specified 
by the transition descriptor D[t], and K +1 My) 
is non-deterministically chosen. 


For every vector-mask output place <V3, Ms>, 


1 
then for vq; € Vy, j = 1, 25 soa Wale 
K tl (vi5) = K, ap + 1 for those j for 


which m;, € M; has a non-zero marking, i.e., 
where Kn+1 (m; 5) a (1). 


The assignment of a Boolean vector to M; by MFT 
is a mapping of M,; INTO W(M,;), where the domain 


is M; and the co-domain consists of the elements 


of w(M;), i.e., Mz 2NZO womy). 

By the firing rules, firings remove tokens 
from places and add tokens to other places, and 
in the case of the mask: firing mark the masks of 
the vector-mask output places. It should be 
noted that the number of tokens subtracted by a 
transition firing does not necessarily equal the 
number that it adds. 


Transitions and Their Descriptors. As seen 


from the firing definitions, SFTs on firing do 
not change the K(M; ) of their <V;5 M,> input and 
output places. The transition descriptor is 
noted simply as D[t] = [SFT; _]. 


For MFTs, the |w(M,)| > 1 for all output 
masks, and since these markings are determined 
by the transition firing and not the initial 
marking, the set of markings must be listed 
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in the transition descriptor, i.e., 


Dit] = [MFT; KM) e{< >< >...H 


Example of Mask Firing Transitions. Figure 
5 through Figure 8 show MFTs in an S Net and © 


illustrate their graphic portrayal and behavior. 
Given an initial marking and descriptors: 


Ko (sy) = (1): Kg(s,) = (0); Kg(s,) = (0) 

Kg (V,) = Kg(V,) = <0, 0, 0> | 

D[t,] = [MFT; K(M,) € {<1l, 0, 0>, <0, 0, 1>}] 
D[t)] = [MFT; K(M)) € {<l, 0, 0>, <0, 0, 1>}] 
D[t3] = D[t,] = D[ts] = (SFT; ] 


Figure 5 reflects the initial marking and 
shows that ty is enabled. Transition ty has one 


output mask and |wamt,) | = 2, both elements of 
which are shown on the graph. When transition ty 


fires, the results are shown in Figure 6. The 
marking assigned to Mj by t; was <1, 0, 0> which 


is designated in D[t,] and is shown on the graph 
as one member of the set W(M,). (At the time of 


firing the mask marking that is chosen by the 
transition is arbitrary.) After t, fires, vj 


receives one token since mz; is marked with a 
token, and a token is removed from S1° 


The Vu1 token enables t, since m4 also holds 
a token, so that the firing of t5 can commence. 
If tg fires changing the marking K(M5) to 
<1, 0, O>, and if the firing sequence ty, to, tz; 
t, is assumed, then Figure 7 illustrates the 


marking after t, has fired. In Figure 7 a token 
is again in 843 K(M,) is <1, 0, O> from the 
previous firing of t,; K(My) is <1, 0, 0> as 
marked from the previous firing of to; K(V,) is 
<0, 0, O> since the token in vj placed there 

as a result of the first t, firing was removed at 
the firing of to. K(V,) is <0, 0, O> since the 
token in v5 placed there after the firing of ty 


was removed at the firing of ty. 


Where Figure 7 shows t, enabled, Figure 8 
shows the results after the second firing of ty: 
Here K(M,) = <Q, 0, 1>, a marking alternative also 


described in D[t,]. (The selection of the marking 
is arbitrary, and could have been <l, 0, 0> 
again.) Given K(M,) = <0, 0, 1>, by firing defi- 


also 


nition V13 now receives a token. 3 


holds a token, t 


Since m 
1 


9 is enabled, and so on. 


To analyze an algorithm, a sequence of S Net 
transition firings can be examined. The sequence 
resolves conflict, or indeterminacies in computa- 
tional flow, in that a particular order of firings 
is assumed. (Conflict is typified in Figures 5 
through 8 by t, and ts which share an input place; 


when one token resides in the place, only one 
transition can fire and which transition fires is 
indeterminate [6, 7, 8, 9, 10]). Analogously, 

in modeling a specific computation, the indeter- 
minacy of mask selection by MFTs is not trouble- 
some by assuming an order of mask selections. 


S NET APPLICATION 


To apply S Nets it is necessary to relate 
the model to the actual algorithm. An INTERPRE- 
TATION of an S Net is an assignment of labels to 
the transitions and/or places of the Net to 
indicate for the transition the event that 
it models and for the place the condition that 
it models. 


Consider the example of summing the rows of 
a 4x4 matrix A, multiplying the sum by a 4xl 
vector B, and storing results in the first column 


of A. The FORTRAN description is: 
DO 200 T= 1, 4 
DO 100 J=1, 3 


100 ACI, 1) = ACI, 1) + ACI, J+1) 

200 ACI, 1) = ACI, 1) * BCI) 
Assume an 8 PE SIMD machine is available. A 
data storage scheme for the vectors is depicted 
in Figure 9, which indicates that PEo has the 
first row of A in its PEM, PE, has the second row, 
etc. With parallel hardware available, the four 
row sums can be formed simultaneously. This is 
shown in the S Net model of Figure 10 which is 
marked to reflect a holding of condition after 

a ty firing. The S Net uses vector-mask places 
to model conditions in array resources and 

scalar places to model conditions in scalar re- 
sources. The masks of the vector-mask places 
model control over the participation of the array 
resources. 


Transition t., models the event which adds 
the Jth column of matrix A to the first column 
of matrix A. Places s, and s5 model the condi- 
tions resulting from the test of J. Assuming a 
sequence of firings such that ty, and ts have con- 
flict resolved by the status of loop index J, 
when ty fires, all rows have been summed; then 
t- models a parallel multiplication of all four 
sums by the appropriate element of vector B. The 
utilization of the array resources is a by- 
product of this S Net execution, i.e., the mark- 
ing on the vector-mask output place resulting 
from the ty firing is depicted on Figure 10 as 
4/8. Both the parallelism achieved by the array 
resource (4) and the utilization with respect to 
the maximum parallelism supportable by the array 
hardware (8) becomes apparent. Also, the marking 
of the mask suggests some additional management 
activity required of the vector resource and is 
specific as to which PEs will participate -- PEg 
through PE.. 


With every execution of the array events 
modeled by SFTs ty and t7, PE, through PE3 always 
participate. However if the problem context is 
changed to require the alternative of using PE, 
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through PE7 in a different iteration of the com- 
putation, these array events are more aptly 


modeled by MFTs, i.e., 
D(t>) = [MFT; K(M,) € {<14, ot, <ot, 14>} 
D(t7) = [MFT; KM) « {<14, of <ot, 153] 


For analysis, an assumption would then be made 
about the order of selection of the masks, i.e., 
<14, 04>, then <04, 14>. ‘This capability for 
alternation is readily distinguishable on the 
graph, contributing more detail for analysis. 


More exposition of the modeling capability of 
S Nets, particularly the properties of concur- 
rency and conflict, is supplied in [10] as are 
additional (and less simple) examples. SIMD 
algorithms can be modeled with Petri Nets, but 
with increased modeling complexity. S Nets are 
distinct from Petri Nets in the notions of vector- 
mask places and mask firing transitions. Many 
Petri Net places are created in lieu of a single 
vector-mask place. It requires many Petri Net 
transitions in forward conflict to model the more 
concise mask firing transition [10]. 


The richer detail of S Nets is illustrated by 
modeling Flynn's classes of architectures shown 
in Figure 11. Because of the availability of 
vector-mask places in addition to scalar places, 
the multiplicity of the data stream can now be 
depicted; also scalar and vector activity can be 
distinguished. The SIMD architecture is readily 
distinguished from SISD by the added detail of 
vector-mask places. The MIMD-2 architecture which 
allows both scalar and vector concurrency, is 
distinguishable from the MIMD-3 architecture which 
allows concurrent SIMD resources (MSIMD). Both 
are clearly different from the MIMD-1 architecture 
of conventional distributed processors. 


SUMMARY 


In this paper, Synchronous Nets,or S Nets, 
have been formally defined. S Nets are a modifi- 
cation of Petri Nets, specifically developed to 
provide richer detail for modeling the SIMD en- 
vironment. It is possible to model SIMD computa- 
tions with Petri Nets but at the expense of in- 
creased modeling complexity. However, the rela- 
tionship of S Nets and Petri Nets is explored 
elsewhere [10]. 
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ABSTRACT -- This paper is concerned with minimiz- 
ing the delay due to data communication during the 
execution of a parallel algorithm on an SIMD com- 
puter with a two-way circular unit-shift intercon- 
nection network. Algorithms are developed which 
determine, for a given parallel algorithm, the 
order of computation of a parallel arithmetic ex- 
pression, the alignment of operands for every bi- 
nary operation, and the mapping and remapping of 
data into physical memories so that the commmnica- 
tion cost is minimized. The proposed techniques 


are applicable to array variables with special 
types of index functions. 
1. INTRODUCTION 
The total execution time of a parallel 


algorithm on a multiprocessor system can be broken 
down into the actual computation time and the time 
of interprocessor data communication. On an SIMD 
(Single-Instruction stream, Multiple-Data stream) 
machine, the computation time is usually dependent 
only on the number of processing elements, and the 
communication time is dependent on several factors 
such as the interconnection 
storage scheme for data. Previous analyses 
[1,2,8] have shown that the data communication can 
be a major cause of degradation of the performance 
of the algorithm. In this paper, we study the 
minimization of the communication cost for a class 
of parallel algorithms and interconnection 
networks, which will be described shortly. 


A restricted version of this minimzation 


problem is considered in [11], where it is called 
the mapping problem, and the emphasis is on the 
effect of data storage schemes on the 


communication time. Some further exploitation of 
the problem is reported in [6]. In section 2, the 
minimization problem is formally defined. In 
section 3, we provide algorithms for reordering of 
computation in. order to minimize communication 
delay. In section 4, we provide algorithms first 


for static mapping of data and then ffor data. 


remapping for further improvement of communication 
delay. Proofs for the optimality o these 
algorithms are to be found in [5]. | 


2. THE MINIMIZATION PROBLEM 

Since this minimization problem deals with both 
software (algorithms) and hardware’ (interconnec- 
tion network) of the SIMD system, a few parameters 
from both parts are needed for formally defining 
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N denotes the 
and physical 


the problem [6]. In what follows, 
number of both logical memories 
memories. These parameters are: 


i. The index set {0,1,2,...,N-I}m of the 
logical memories m of the algorithm. Each 
component a; of a vector A is assumed to be stored 
in logical memory m;. 


ii. A sequence of logical data transfers that 
satisfies the communication requirement of the 
algorithm. Each logical transfer py, is a partial 
function, for the alignment of data u and v. It 
maps the index set {0,1,...N-I}m into itself. 


iii. The index set {0,1,...,N-I}m of the 
physical. memories. : 

iv. The set Q of interconnection functions 
A FP Pa 18 that defines the interconnection 
network. Each qj; is a bijection on the index set 
{0,1,...,N-l}m, and designates the transfer that 
can be physically effected in one routing step. 

v. The distance function D, associated with 
the interconnection network, on the set of all 
partial functions on {0,1,...,N-I}qy. The value of 
this function on a partial function is the minimum 
number of routing steps needed by the network to 
realize that partial function. 


vi. A mapping, or, storage 
vector v of the algorithm is a 
{0,1,...,N-l3m onto {0,1....,N-T} yy. 


scheme, Fy for 
bijection from 


Let Fu and Fy be the mapping functions for 
variables u and v. respectively, and let Ppyy be a 
logical transfer for aligning u and v. Suppose pyy 
aligns u and v by moving u to v. Then at the 
logical level, pyy corresponds to moving the ele- 
ment u (Fo? (i)) from logical memory Met) to 


" BuyFuttiy At the physical level, however, this 


data movement is from memory Mi, 
is stored, to -M oT ee 
FyPuvFG Ci) 


such an alignment operation is GEE Piya) 


where u (Fo? (i) ) 
Hence the cost of 


Instead of aligning the variable u with the 
variable v, or v with u for a= binary’ operation 
involving u and v, suppose the variables are 
aligned somewhere else. For example, suppose’ the 
binary operation A(i + k,;) + B(i + kg) is 
performed in the processor (i + t) and the result 
stored in the memory Mjst. There is no loss of 


generality if the resulting vector is named 
W(i + t). In other words, the mapping function Fy 
is an identity function. The total cost of 


communication in this case is the sum of the costs 
of aligning u with w and aligning v with w. This 
sum is given by D(py, Fy?) + D (yw Fv “1). For the 
above example, we have pay(i) (i +(t -k,)) and 
(i) = (i + (t-k,)). (the arithmetic an the 
indices in this paper ee all modulo N.) 
The above result can be easily generalized to 
a parallel expression S. Let v,,v,...,Vy, be the 
variables in S and let E denote an expression tree 
for S. Let w, ,Wz...,we be the internal nodes of E 
where each w;j represents the partial result of 
some binary operation on variables and/or other 
internal nodes. For a given expression tree E, 
which specifies the order of computation of S, the 
alignment of operands for every binary operation 
must be determined from which the logical transfer 
functions can be specified. Thus, for a statement 
S, the communication cost is 


, Dw fy) , } Pw) 


Vi: variable Wi swis internal 

(leaf node) nodes 
W;? partial wj=PARENT (w;,) 

result (inter- 

nal node) 
w) =PARENT (v; ) 
where PARENT (v; ) denotes the parent node of Vj in 
tree E. 

Notice that for a given statement and a given 

interconnection network, the total cost depends on 
the mappings Fy. 's and the transfer functions 


‘san Ss j 
Pui w; d Pw wi Ss The transfer functions used 
in the cost calculation depend on the expression 


tree itself (for an expression involving commuta- 
tive operatjons the expression tree is not unique) 
as well as on the alignments of operands. 


The complete minimization 
complete algorithm can 


problem for a 
therefore be stated as: 


follows: 
Given a parallel algorithm with 
{0,1,...,N-I}m, an interconnection network with 


{0,1,...,N-I}y, and D, determine the following: 


1. an expression tree, or, a computation ordering, 
for every parallel assignment statement, 


2. alignments of operands for binary operations, 


3. a mapping function Fy, for every variable v, 


such that 
“| 
» _ a | Pyiwjty;)* » Par) (2) 
state- V;: variable Ww; Wjtinternal 
ments of (leaf node) nodes 


given al- wy! partial 
gorithm result (inter- 
nal node) 
w; =PARENT (v; ) 
is minimized. 


Wj =PARENT (w;) 
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The general minimization problem is very 
complex, and we shall restrict ourselves to the 
following environment. The SIMD machine ‘is 
assumed to have aie two-way circular unit-shift 
network on N processing elements (PEs), where, for 
notational simplicity, N is assumed to be an even 
integer, i.e., for this network, Q={q,,q,} and 
q, (i)=(Ci+1), q, (i)=(i-1). For convenience, this 
network will be 2 -eferred to in what follows as the 
circular network. Also assumed is the fact’ that 
the index functions of the array variables of the 
algorithms are all of the form: i + constant. In 
other words, we assume that the logical transfers 
for the variables are all of the uniform-shift 
type, i.e., for some variables u and V; 
Puy (i) = (itk) mod N, where k is an_ integer 


constant. If an algorithm uses both uniform-shift 
type and other types of transfers, then we shall | 
consider the minimization problem for’ those 


varialbes which are involved in only uniform-shift 
type transfers. Special types of data permuta- 
tions, such as perfect shuffle permutation and bit 
reversal in FFT, do exist. However, more often 
than not, the variables are of the type considered 
in this paper. From this and the fact’ that 
practically every interconnection network contains 
the circular connection, the techniques developed 
here are widely applicable to many problems. 


In this section, we shall develop 
for determining, for a given expression, a 
sequence of logical data transfers that satisfies 
the communication requirement of an algorithm with 
minimum cost. it is assumed that the mapping Fy 
for every varjiable v is an identity mapping. Thus 
the cost fuhction (1) for a parallel expression 


techniques 


becomes 
Vv. 3: variable wy switinternal 
(leaf node) nodes 
Wit partial wj =PARENT (w;) 
result (inter- 
nal node) 
wj =PARENT (vj) 
To minimize this cost, therefore, the 


following parameters must be determined: 


Ff. alignment of operands for every binary 
operation in a parallel assignment statement (PAS) , 
and 

ti. an ordering of computation of a PAS. 


We shall assume that a compiler is available 
which generates for PAS's, as intermediate code, a 


sequence of three-address triples (see [3], for 
example) on which our minimization is performed. 
A typical PAS, for example, has the _ following 
form: 
A(i) = B(i+2) - C(i-3) * D(i+]) (O<i <N-1) 

Our algorithm will determine, for example, where 
the operations * and - should take place such that 
the communication cost of evaluating this PAS is 
minimal. The logical data transfers, for example, 


p and Py 
colt then be easily specified. 
Some basic definitions are needed. 


Definition 1. Let A(it+k). be a variable term in 


some parallel assignment statement, where k is an 
integer constant and «= - (N/2-1) SkSN/2. Then the 
displacement associated with this occurrence of 
variable A is k. 

On a circular network of N PEs, the 


displacement k then refers to the PE that contains 


the first element, A(k), of the vector A(i+k). 
The displacement associated with the target 
variable of a PAS is assumed to be O, _ for 
convenience. lf it is not originally zero, an 
adjustment can be made to the indices of the 


variables in the PAS such that it becomes zero. 
We shall use disp(v) to denote the displacement 
associated with the operand v. 


Definition 2. The alignment point (AP): For a bi- 
nary operation on two operands A(it+k,;) and 
B(itky), for all i, O < i $ N-l, if the yoperation 
on the (i+k, )* element of A qnd (i+k,)* element 
of B takes place in the (i+aP)®® PE, then AP is 


the alignment point of that operation. 


This value of AP is then the displacement 
associated with the partial result. The AP of 
such an operation node n in an expression tree E 


is denoted by AP(n). 


Example 1. For PAS A(i)=B(it2)-C(i-3)*D(i+1), if 
AP for the multiplication is 0, then, elements of C 
must be moved from i PE to (143)"9 PE and those 
PE to (i-1)§ 


of D from i PE. 

Let k, and kg be two displacements. Then ky 
and k 2 are the boundary of two intervals on the 
circular network. Let I(k;,k2) denote the 
interval with shorter length. Then it is easy to 
see that for a single operation such as 


A(it+k,)+B(itk,), the communication cost is minimal 
and equal to |I (k,,k,)| if AP € I (ky ,k,), where 
|! (k,,ky) | denotes the length of I (k, sky). If, 
however, this operation is a node in some expres- 
sion tree E containing more than one operation, 
then the following result is more general. 


Theorem 1. Let A(it+k,) + B(i+k,) be an operation 
in an expression tree E which is not the root of 
E, and let x denote the alignment point for this 
operation. Then a@ necessary condition for the 
communication cost of evaluating E to be minimal 
is 
x € I tk, yk) if 3] 1 (ky kg) |<N, 

or x & I (k,,k,) U I (k,-t,k,+t) otherwise, 


where it is assumed that k,<k , and 
Outline of Proof. What needs to be shown is_ that 


if x is not in the above interval, then we can 
always find another alignment point x', which lies 
in the interval, such that the cost of aligning A 
and B at ~x' plus the cost of moving the partial 
result from x' to x is less than the cost of 


aligning A and B at x. @ 


,, where W denotes the partial result, 
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- Theorem 1 simplifies the minimization process 
greatly because the alignment points are now 
confined to lie in a small interval. Also, in 
real problems the displacements are usually smal]. 
compared with N, i.e., the value 3|1I (k,;,k,)| is 
less than N in most cases. We shall further 


assume in what follows that 3|1I(k,,k2)|<N for 
every pair of displacements k, and ky). The 
necessary condition in’ this case thus becomes 
AP(n) € [k,,k,] for every operation node n. (The 


treatment 


[5].) 


of the general problem can be found in 


Based on Theorem 1] and the above assumption, 


we are now able to further reduce the cost  func- 
tion. Suppose 0, and 0, are the operands of some 
‘binary operation which may be either variables or 
partial results. .Let w denote the result of this 
operation and let k, = disp(o,) and ky, = 
disp(o,). Then since AP(w) € [k,,k,], on a 
circular network, the communication cost is 


D (Py wy) + D (Pow) = [k, - ky |> and the cost func- 
tion for a PAS becomes 


>» JAP(R) - AP(L)| + [AP(root of E)|, (3) 


R,Lsbrother 
nodes in —E 
where the term AP{root of £) is needed if E 
corresponds” to a PAS; and what is_ to be 
determined is the alignment point for every 
internal node of E. 

This problem is solved using the _ following 


algorithm, where, for convenience, we define the 
alignment point of a leaf node n with displace- 
ment k to be AP(n) = k, and write AP(n) € [k,k]. 


Algorithm 1 


/* Given E and disp(v) for every variable v ine, 
this algorithm determines AP(w) for every internal 
node w such that (3) is minimized. */ 
Traverse the expression tree E in postorder 
At each node z, | 
/* Basic step */ 
if 2 is a leaf node with disp (z) =k, 
j.e., AP(z) € [k,k] 
then AP(z) = k; 
/* Recursion step */ 
if z is an internal node with child nodes x and 
y and AP(x) € [x1,x2] and AP(y) € [yl,y2], 
and W = [x1,x2] N[yl,y2] 
then 
casel. W= ¢ 
/* AP(z) is in’ the 
the two intervals */ 
AP (x) x2, AP(y) = yl, and 
AP(z) € [x2,y1]] if x2<y] 
AP (x) = xl, AP(y) = y2, and 
| AP(z) € [y2,x1] if y2<x1 
case2. W # ¢ 
subcasel. |W|=1, i.e., W={w} where 
w=x]l=y2 or w=x2=y 1 
AP (x) = AP(y) = AP(z) = w3 


interval "between! 


subcase2. |W| >] 
AP (x) = AP(y) = AP(z) E W; 
if z is the root of E and AP(z) € [21,22], 


then AP (z) t where t € [z1,22] and the 


absolute value of t is minimum. 


End 


f 


Algorithm 1 


Example 2. We apply Algorithm 1 to the following 
PAS 
A(i) = B(i) + B(ith) - C(it+3) - D(i+3). 


“The alignment points obtained using Algorithm 1 


are shown in Figure 1(a). The communication cost 
is 4 + 3 = 7. Compared with cost = 10 in Figure 
1(b}), which is obtained using a compiler without 


such optimization, this shows a 30% improvement.«# 


Now note that in (3), the structure of the 


O.A(i) 


O A(i) 
cost = 7 


C(it3) 


@) ; 
B(i) B(it4) 


(b) 


B(i) B(it+4)- 
(a) 


Figure 1. Alignment points for a PAS. 
expression tree, or the order of computation, is 
to be preserved. This restriction is sometimes ne- 
cessary, especially in problems where computation 
reordering may cause loss of significant digits, 
overflow, or underflow, etc. However, if computa- 
tion reordering is permitted, then further im- 
provement in the communication cost may be pos- 
sible. This is discussed below. 


We shall consider computation reordering based 
on the commutativity and associativity of binary 
operators. In other words, we shal! perform compu- 


tation reordering only on subexpressions in’ which 
all operations have the same precedence. 

Let E denote the expression tree for a 
parallel assignment statement. The precedence of 
an internal node of €E is the precedence of the 
operator associated with that node. 

Definition 3. A subtree T of E is called a p_sub- 


tree if all the internal nodes of T have the 
precedence and 


same 
every leaf node of T is either a 


variable, or an internal node of E which has a 
different precedence. 
The precedence of a = p_subtree is the 


precedence of the internal nodes of the p_subtree. 


Definition 4. A maximal p_subtree of E is a p_sub- 


tree T of E such that the root of T is either the 
root of E or the child of a node of different 
precedence. 


Thus a subexpression that corresponds to some 
maximal p_subtree of E is the basic unit for the 
consideration of computation reordering. 


Example 3. In Figure 2, T, and Tz are maximal 
p_subtrees of the same precedence. T, and T, are 
also maximal p_subtrees of the same precedence. a 


ZI] 


Figure 2. An expression tree and its p_subtrees. 


For any expression tree E, the reordering pro- 
cess consists of two phases, the analysis and the 
synthesis. In the analysis phase, the following 
tasks are performed: 


1. Identify all maximal p subtrees T in E. 


2. For each T, determine the interval in 
which the optimal alignment point for the root of 
T, after reordering, must lie. 


3. 


Fix the AP for every node of E. 
Lh. Determine the effective operation to be 
applied to each variable in E (see below). 


The maximal p_subtrees of E can be identified 
by traversing E in the order of levels. During the 
traversal, if the node being visited has a 
precedence different from those already visited, 
then it is the root of some new maximal p_subtree, 
which is yet to be identified. For example, the 
"divide'’ node in Figure 2 has precedence different 


from that of addition, and its the root of 
p_subtree 1 ,. This can be done using a recursive 
routine, which, in addition to identifying the 


maximal p_subtrees of E, shall also determine the 
effective operation to be applied to each operand 
on a maximal p_subtree. For example, for a subex- 
pression A(i) (B(i+2) - C(i-1)), the effective 
operations for A, B, and C are, respectively, +, 
» and -. 

After a maximal p_subtree T is identified, 
the next task is to determine for T the new order 
of computation for which the communication cost is 
minimal. This is done by first determining the 
alignment point for the root of T. Suppose dj, 


d,,...,d, are the displacements for the operands 
VioVaree9Ves respectively, in  T and = d,2d,2 
-..2d,. Then it can be shown [5] that the align-. 


ment point for the root, AP(root of T) must lie in 
the interval [d,,d,] to ensure that the communi- 
cation cost is minimal. This continues until 
AP (root of E) is obtained. Tree E will then be 
traversed again for fixing the exact value of the 
AP for the root of every maximal p_subtree T of E. 


In the synthesis phase, new expressions are. 
generated for every maximal p_subtree T of E. Let 
AP(root of E) = h € [d,,d,], then the new order of 
computation can be easily obtained. Following is 


the rule for obtaining the new computation order 
for which the communication cost is always equal 
to |d, - d,]. 
Case 1. d, = h. The new expression is 
((... (v, Ov, )®...) Ov,) » 
where @ denotes the effective operation. 
The AP's of these operations are, respec- 
tively, d,,-2,d,. . 
d, = h. The new expression is 
((... (Vv, Ov_-;) @) ...) Ov, ) . 
The AP's are diy > Gpangs e+ dy. 
Case 3. dy, < h<d,. Let t be such that 
dee; <h Sd. 
The new expression is 
(... (v, Ov, )@...Ov_) © ((... (VY Ov, )..-) Ov44;) - 
The AP's for the left half are 
d4,d3,..,d¢, and for the right half 
Dy nprcee dees and the AP for the root of 
the new expression tree is h. 


Case 2. 


As an 
reordered as 
A(i) = ((B(i+4) - C(i+3)) - D(i+3)) + BCI), 
for which the cost is only 4, 40% of the original 
cost and almost half of the cost obtained from 

using only Algorithm 1. 


example, the PAS of Example 2 can be 


Having determined the alignment point for 
every operation in an expression, one can easily 
specify the logical transfer functions that should 
be applied to the operands, and the sequence of 
logical transfers for the entire algorithm is 
obtained. fn the next section, this sequence wil] 
be the data for obtaining the optimal mapping and 
remapping for the variables in the algorithm. 


4, DATA MAPPING AND REMAPPING 


are 


In [11], two kinds of paralTe?t algorithms are 


considered for applying data mapping. In one, 
algorithm is assumed to have only one logical 
transfer. For example, the matrix transposition 
may be realized using only the perfect shuffle 
permutation. For this type of algorithms, the 
optimal mapping for the data can easily be 
obtained [11, Theorem 1]. In the second kind of 
algorithms, the logical transfers of a given 
variable are not all the same; for example, 


bitonic sort on a mesh-connected network uses 
several different transfers. The minimization 
problem for this type of algorithms is usually 
difficult. A technique used in [11] for the 
bitonic sort is to determine an optimal mapping 
for the most expensive transfer, which is 


certainly not always the heast expensive mapping. 


In this section, we shall first study the 
minimization problem for the second kind of 
algorithms with the restriction that logical 
transfers are all uniform shifts but may have 
different shift distances, and that the mapping is 
static. In sectiom 4.2, we discuss how to improve 
the cost further by allowing remapping of data 
during the execution of the algorithm. 


the. 


the: 
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memory Fj (k+AP(i)), will now be moved 


4.1 Static Mapping 


The problem is the following: given a parallel 
algorithm and a circular network, determine for 
each variable of the algorithm a static mapping. 
such that the communication cost of operations 
involving these variables is minimal. The logical. 
transfers are assumed to have been obtained using 
the techniques of the previous section. 


First formulate the cost function. Let F; be 
the mapping function for a variable v; and F; (k) = 
(k + Xi), where O S k SN-1 and x; is to be 
determined. The cost of evaluating a PAS can be 
derived as follows. Let £ be the expression 
tree. Then, from section 3, the cost at the 
logical level is simply 


> JAP(R) - AP(L)| + |AP(root of E) | 


R,L: brother 
nodes in — 


lf non-identity mapping is allowed for variables, 
then the cost of any alignment operation involving 


a variable becomes dependent on the mapping 
function for the variable. For partial results, 
however, the cost is a function of only their 


associated alignment points. 


The cost of an alignment operation involving 
a variable is illustrated in Figure 3. The 
elements of v,, v; (k+AP(i)), originally stored in 


to memory 
(k+AP (PARENT (i))) where the partial result shall 
reside. So the cost of this operation is 


[Fs (k+AP(i)) - (k+AP (PARENT (i))) | 
= |x; - (AP(PARENT(i)) - AP(i)) |. 


PARENT (i) AP (PARENT (i) ) 


Vi 
AP (i) =disp (i) 


Figure 3. An alignment operation. 


all 
Let 


To determine the optimal mapping for v;, 
such terms involving x; have to be collected. 


r; denote the total number of logical transfers 
involving v;, and let dj, denote the value 
AP(PARENT(i)) -AP(i) in the k*” transfer of v;. 
Then the total cost due to variable v; can be 
written as 
| aa 
' 


which is minimized by setting 
x, = median of {diz | Te k < ri}. 


Example 4. Suppose the displacements associated 


with variable v; in an algorithm are 


AP(i) = disp(i): 2 0-1 0-1 
and the alignment points for’ the 
involving Vv; are 


operations 


AP (PARENT (i)) : O 2 | <b 25 
then the shift distances for Vv; are 
and the median of id. J is 2. 


Then the cost using the mapping F; (k) = (k + 2) is 
6, while the cost using the identity mapping would 
be 10, indicating a 40% improvement. s 


lt is easy to see that the improvement would 
be high if the sequence of transfers is long and 
the median is far from zero. If the median is 
equal to zero, then the mapping is simply the 
identity. It is also easy to note that if the 
sequence is long, perhaps one can perform several 
remapping during the execution of the algorithm so 
that the communication cost is further reduced. 
This data remapping is discussed below. 


4.2 Remapping 


In [10], remapping techniques are applied to 
a class of parallel algorithms which are assumed 
to be executed on an SIMD machine using a shuffle- 
shift interconnection network. Most of these 
algorithms have logrithmic computation but higher 
communication complexity. After remapping, many of 
the algorithms are balanced. This analysis again 
indicates that software techniques are a more 
flexible tool for providing better solution to the 
improvement of the performance of parallel 
algorithms. In this section, we study the 
remapping problem under the assumed environment. 


In general, the major task in the use of data 
remapping is the determination of when and how to 
perform a remapping so that the new cost, the com- 
munication cost plus the cost of remapping itself, 
is less than the cost of using only. static 
mapping. 

In what follows, we shall use L to denote the 
sequence of all logical transfers of a variable v 
that are required by the given parallel algorithm, 
i-@., L=d,, dg,...,d, where d; is the shift 
distance for v in the ith transfer of v. A 
remapping schedule for v is defined as a division 
of the sequence L into subsequences, and within a 
subsequence only a static mapping function for v 
is used, and a remapping of the data must be 
performed before a new subsequence which uses a 
different mapping function starts. Clearly, the 
mapping function for v within a subsequence should 
be the optimal static mapping for v with respect 
to this subsequence, which can be obtained using 
the result of section 4.1. A remapping schedule 
for v is said to be optimal if the total cost with 
respect to this schedule is the least among all 
possible schedules. 


Let the interval defined by two consecutive 
distances d;, dj,, of L be denoted by |;, 
H(d;,di,,)- Let Jl, denote the 


shift 


vtes bi = 


intersection of all intervals |;, 
itty daeGs6 
|L|-1 


i? 
i=] 


where [L | denotes the number of elements in L. 


Following are some of the 
sequence L for which | # gd. 


properties of a 


Lemma _1. Let |, = [a,b]. 
median of L, is given by 


lf a < b, then dup the 


a i f d,; = a and d,s a, 


x, where x € [a,b], otherwise. . 


if I, = [a,a], then it is easy to see that dy 
= a if |L| is odd, and, if |L| is even, both dy 
and a must lie in the interval defined by the two 
median numbers of L. In the latter case, we shall 
set dy = a, so that in all cases dy € |, . 


In the following Lemmas, C (L,dy) denotes the 
communication cost resulting from the static 


mapping Fy(i) = (i + dy) for v with respect to L, 
Leas 
k 
j= ] 
Let L' denote any subsequence of kL having s 


elements. Let cy, denote the median of L'. 


Lemma 2. Suppose |, #6 and dy has been fixed 
usign Lemma 1. If s is even, then it is possible 
to choose c,, such that 

E (tise, = C(t yay) 


and if s is odd, then 


Clitsd PCG a dees oe 


Lemma 3. Let L; and Lj,, be two consecutive sub- 
sequences of L, where [L5 J=s; and |Lj4,[=Sj+,- Let 


. * . j 
their medians be . denoted d- and qd?" » respec- 


tively. Let Lt denote the concatenation of L5 and 
Ljape If ty = [a,b] # ¢, then 
Case i. Sy» Sie both even. 
J jet 
din and din can be chosen such that 
Jj Jth ' 

C(Ljsdy) + ClLi yy sda, ) = CCL; ody). 

Case ij. sj odd, Sit even. 


i+f 
ai" can be chosen such that 


| ae J+ “ae } 
ClbjrdytC (Lip odin )#/dy ~ dial = CCL; ody). 


Case iti. s:, Ss: 


j jet both odd. 
j i 1+ + 
lf - da (a) 2 dy)» then ave! > aie. : 
< dy) - And 
ia ae Sis. 
in Lemma 4, Id, - q? J and a a are the 


cost of remapping. "his ™ emma shows that if IL #6, 
‘then no gain will be achieved by dividing L_ into 
two subsequences. A generalization of this Lemma 
is very useful and is given below. 


Theorem 2. If "7 #4, then no remapping schedule 
Gf for v will result in a communication cost less 


than C(L,d,) . a 


Theorem 2 implies that, as a first step, one 
should determine a sequence of subsequences of L, 
i.@., determine P='L,,L,,..-,Le, such that every 
Lj is maximal in the sense that Ls #¢ and the 
inclusion of either two elements of L that precede 
L; (i # 1) or the two elements that follow L; (i # 
t) will result in IL = g. This can be achieved 
using Algorithm 2 below. 


Algorithm 2 


shift 
F_X(1:k), L_X(1I:k) store 
start and end indices of subsequences. 
L_MEET(1:k), H_MEET(1:k) low and high bounds of 
intersection interval for L;. Q.X: index for the 


/* given sequence of logical transfers (or 
distances) in L(i:k). 


number of the subsequence Lj [ml,m2]: current 
intersection interval. [t],t2] = [L(i), L(it1)]: 
next interval to be processed. */ 


<2 then stop; /* no remapping necessary */ 


if k 

QX= 1; /* for Ly */ 

FX(1) = 1; /* start with dy */ 

ml = min(L(1),L(2)); /*dj = L(i) */ 
m2 = max(L(1),L(2)); /* [ml,m2] */ 


ji = Bh /* next interval is [L(2),L(3)] */ . 
Flag = '1'B; /* until [ml,m2] = ¢-*/ | 
do while (i < k); /* L(2), ae »L(k) */ 
tl = min(L(i),L(it1)); t2 = max(l (bas 
/* test if intersection is empty */ 
call CHK_MEET (m1,m2,t1,t2); 
if “Flag then do; /*end of this sebsequence */ 
L_X(Q_X) = i - 1; /*index of last item */ 
L_MEET(Q_X) = ml; /*intersection interval */ 
H_ MEET(Q_X) = m2; 
QX=QX + 1; /* next subsequence */ 
F_X (OX) = i; /* start of Lia) */ 
2 ; 
ml=min(m2,L(i)); /* new starting [m1,m2] */ 
m2 = max(j,L(i)); 


Flag = '1'B; 
end; /* of if then do; */ 
else i = i + 1; /* just get next two items */ 


end; /* of do while */ 
/* for last subsequence */ 
L_X(Q_X) = is; L_MEET(Q_X) = ml; H_MEET(Q_X) = 


CHK_MEET: Procedure (m1,m2,t1,t2); 

/* compute intersection of [ml,m2] & [t1,t2] */ 
/* set Flag to 'O'B if it is empty */ 

if tl>m2 [| t2<mi 


fix the 
dp=-1. Then 


/* of this subsequence */ — 


then do; Flag = '0O'B; 
if tl = m2 then do; ml tl; return; end; 
if t2 = ml then do; m2 = t2; return; end; 
ml = max(ml,tl); m2 = min(m2,t2); 

end CHK_MEET; 


return; end; 


End of Algorithm 2 2 


After obtaining Gusing Algorithm 2, the next 
step is to fix the median of every subsequence 
L,. If [L; | is odd, then diis simply ,the median 
of L;; but if |L;| is even, then d, can be any 
value in the interval defined by the two median 
elements of Lz. For convenience, we shall cal] 
such an interval the median interval. Following is 
the rule for determining the median of every L; 
such that the resulting remapping schedule is 
optimal. | 


Median Selection Rule 


Given L = d,, d,, 

+, Ly using Algorithm 2. 
i. if [Lj] = odd, then ai = median of L;. 

ii. if {Lj | = even and if "the median interval of 

L; is of the form [w,w], then d, = w. 
iii. for every ig SS. ty 

if diy is not fixed by rule i or ii and 
a € [lo,hi], then 


mdie lo if dit! < to 


. itt 
d= hi if dy > hi 


2b, obtain P= Lis Las 


End of Median selection Rule 


(Note that median intervals of any 
consecutive subsequences never overlap.) 


Example 5. Suppose 


for some variable v, L = 1, 3, 


4k, 5, 6, 1, -2, -1, 0. Applying Algorithm 2 to L, 
we get P= t,, La, Lz, L4, Ly , where L,= 1,3, 
Loa= 4, Lg= 5,6, La= 1, Le= -2,-1,0. We can first 


following medians: dji=4, dy=1, and 
using the Median Setection Rule, we 
obatin the medians: d.=3, dy=5. It is easy to 
compute the following communication costs: 


C(L,d,) = C(L,1) = 20, and 


Sct, is y ia ae 


= (240) +0+ (0+1) +04 (14041) ] + [1+1+4+2] = 


Thus, with remapping, the cost due to variable v 
is further improved by more than 30%. a 


Algorithm 2, along with the Median Selection 
Rule, generates an optimal remapping schedule for. 
any sequence L. The optimality proof can be found 


in [5]. 


Theorem 3. Algorithm 2 and Median Selection Rule 

constitute an optimal algorithm for generating the 
remapping schedule for any logical transfer 
sequence. «= 


As another more practical example, consider 
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the Jacobi algorithm for computing the eigenvalues 
of real symmetric matrices [13]. In the classical 
Jacobi algorithm, a real symmetric matrix is 
reduced to the diagonal form by a sequence of 
elementary orthogonal transformations. In [9], the 


algorithm is modified for parallel computation on 
llliac IV machine. In this implementation, the 
data broadcasting capability of the network is an 


essential requirement. 
communication cost would become very high and 
outweigh the gain obtained from parallel 
computation. In [4], the matrix multiplication by 
diagonal scheme [12] is used to compute’ the 
transformation, which requires no data 
broadcasting, so the algorithm can be’ implemented 
on a circular network. A further modification to 
the algorithm is given in [7], which also uses the 
multiplication by diagonal scheme. In both [4] and 
[7], the transformation has the form gAgt, where 
A= [a;,] is the given NxN matrix and ¢ = Ct ; ;J is 
the transformation matrix 
aj,ier ANd aj4,,7, | even, to be eliminated. From 
the multiplication scheme, it is easy to see that 
for both t;; and te.) i odd, the sequence of 


Otherwise, = the data 


° 
1 9 


logical data transfers required to complete every 
computation of A is L = 0,0,1,2,3,...,N-2. 
Therefore, we can apply our algorithm to L and 


obtain the following remapping schedule: 


L 


In other words, these two data vectors are 
remapped one PE down the circular network every 
computation step (except for the first and last 


(0,0) 9 (1) ’ (2) ’ (3) goers (N-3,N-2) ° 


subsequences). The communication cost using this 
remapping is clearly O(N) per transformation, 
while if only static mapping is used, the cost 


would be O(N2). 


For this particualr example, although the 
above remapping schedule may have been obtained by 
carefully examining the multiplication scheme, or 
by using the data buffering technique of [10], 
with our method and Theorem 3, the optimality of 
the schedule is guaranteed. 


5. CONCLUSION 


The problem of minimizing the communication 
cost in the implementation of a parallel algorithm 
on an SIMD computer is discussed. For a= given 
parallel algorithm, techniques have been developed 
for determining the order of computation of an 
expression, the alignment of operands for every 
binary operation, the mapping of data _ to the 
physical memories, and data remapping such that 
the communication time is minimized. This 
analysis, as well as other previous work, indicate 
that software techniques are sometimes a more 
‘flexible tool for providing better to 
the improvement of per formance parallel 
algorithms than hardware, which is limited 
by the cost and complexity. 


solution 
of 
of ten 


which causes elements. 
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PARALLEL HASHING HARDWARE 


FOR 


TEXT SCANNING APPLICATIONS 


F, J. BURKOWSKI 


DEPARTMENT OF COMPUTER SCIENCE 
UNIVERSITY OF MANITOBA 
WINNIPEG, MANITOBA, CANADA 


ABSTRACT 


This paper discusses the hardware design of 
a term detection unit which is to be used in the 
scanning of text emanating from a serial source 
such as disk or bubble memory. This unit will 
provide the highly parallel activity necessary 
for the detection of any one of many terms (eg. 
1024 terms) while accepting source text at a 


transfer rate typical of disk technology, for 
example, one megabyte per second. The design 
incorporates a hardware~based hashing scheme 


which allows the incoming text to be compared 
with selected terms in a RAM which contains all 
of the strings to be detected. Since the speed 
of operation is such that any lengthy probe se- 
quence cannot be tolerated, the design involves 
mechanisms which strive to provide a perfect 
hash; perfect in the sense that a probe sequence 
need not be used to overcome the collision prob- 
lem. 


INTRODUCTION 


In many text retrieval systems the data 
base is comprised of a huge collection of docu- 
ments which are essentially unstructured in 
their organization. Indexing facilities may be 
rather limited in scope and hence not sufficient 


to meet the requirements of queries which incor- 


porate arbitrary terms or keywords. Examples of 
such data bases include newspaper repositories, 
legal decisions, journal articles, military in- 
telligence reports, and even smaller data bases 
for corporate offices (see [1]). Since an in- 
dexing procedure cannot effectively narrow down 
the area to be searched, the retrieval strategy 
may involve ‘a serial scan of the entire data 
base or some significant portion of it. — 

Typically, user enquiries are analyzed by 
query translator software which extracts from 
the user queries all of the various terms which 
must be detected in the source text. For exam- 
ple, a typical query might be the concise ver- 
sion of a command such as: "Retrieve all docu- 
ments which contain any 3 of the following 6 
terms: BRASS, TRUMPET, SACKBUT, CORNETT, 
SERPENT, TENOR CURTAL". The query translator 
passes a list of such terms to the term detec~ 
tion unit which. will scan the source text and 
determine the "document location" of each term 
if it exists in the text. This information is 
passed to query resolution software which will 
determine whether the scanned document meets the 
criteria imposed by any user query. 
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If a system is subject to enquiries from 
many users then a typical scan may involve a 
parallel search for hundreds of terms. This 
search is to be accomplished while the source 
text streams by at a rate determined by the disk 
transfer rate of approximately one megabyte per 
second. This type of search is best done by a 
term detection unit which is specifically de- 
signed for such highly parallel activity. This 
paper will discuss the hardware design of such a 
term detection unit. By using a cost effective 
and novel approach, the unit is capable of han-~ 
dling a great many terms. If the reader is in- 
terested in other approaches he may consult [2] 
or [3] wherein the author discusses the problems 
associated with these applications and presents 
a survey of some of the architectures which are 
of current interest. 


TERM DETECTION HARDWARE 


The term detector accepts a serial stream 
of characters which are shifted into a_ serial~ 
in parallel-out shift register SR capable of 
holding k characters. Each shift operation 
causes the input sequence to be shifted one 
character position, The parallel-out lines of 
SR provide input to a set of comparators which 
are also fed by the data lines of a RAM with an 
organization of M=2**m words each 8*k bits wide, 
In a text retrieval system, one might expect 
values such as k=32 and M=256. 

Using this approach a k character substring 
of the source text can be compared with any one 
of M words in the RAM. Each byte of RAM would 
contain a 7 bit ASCII code and an additional bit 
used to signify a "don't care" or unconditional 
match character. ; 

Since we must effectively compare all the M 
terms with the current source substring of k 
characters in the time interval between shifts, 
it is tempting to use an associative memory in 
place of the RAM and comparator, but considering 
the current prices of associative memory such a 
solution is prohibitively expensive. Conse- 
quently, we retain the RAM and use in conjunc~ 
tion with it a "mapping module" which accepts a 
subset X of the data-out bits from the shift 
register and computes H(X) an mbit address 
which will select from the RAM one term which is 
to be compared with the source text in the shift 
register (see Fig. 1). The mapping module will 
be a high speed random access memory with ad-~ 
dress inputs determined by the X-lines and 
data~out lines providing the m~bit H(X) value. 


We now describe the strategy involved. If 
the shift register does not contain a required 
term then a comparison will be initiated with 
some selected term in the RAM (from a practical 
point of view, it does not matter which term is 
chosen) and since a match is not detected the 
shift activity simply resumes. If the shift 
register contains a target term which matches 
some term in the RAM, then we must ensure that 
the RAM address of the matching term is equal to 
the value H(X) generated by the mapping module 
when the target term in the shift register is in 
alignment with this selected term in the RAM. 
This will give an active level on the equality 
output of the comparator when the selected term 
is compared with the contents of the shift reg- 
ister. The generation of the appropriate H(X) 
value is relatively easy to derive unless the 
number of terms is very large compared to the 
number of X~lines. Note that a uniqueness prop- 
erty must be enforced. If we consider any tar- 
get term so situated in the shift register that 
it is aligned with its matching term in the RAM, 
the binary value produced on the X~lines must be 
unique i.e. different from the X~-line value of 
any other target term in its aligned position. 
In effect, H(X) must generate a perfect (no col~ 
lision) hash function for the set of all target 
terms in their aligned positions. The crucial 
point of the design is that the attainment of 
such a perfect hash is greatly enhanced by the 
fact that there is often some freedom of choice 
in selecting the X-value which is generated by 
the aligned target term. This is true since 
most terms will not be the full k characters in 
length and consequently a term can usually adopt 
one of many positions relative to the character 
positions responsible for the definition of the 
X~lines. Note that the appearance of a term 
within its word of RAM is constrained to a posi-~ 
tion such that the aligned target must cover 
those positions of the shift register which de~- 
fine the X~lines. 

In order to illustrate the above discussion 
with an example, consider the following simpli- 
fied situation: Suppose a search is being done 
for the following four terms, "MARS", "MARTIAN", 
"STAR", "ARTIST". Also, suppose for the sake of 
illustration, that there are 14 X—lines defined 
as the middle two characters in the shift regis~ 
ter. In actual practice, it is likely that we 
would use fewer lines defined by more charac~ 
ters, say the middle four character positions. 
With these assumptions the X-line outputs for 
"MARS" would be "MA", "AR" or "RS". The given 
terms and their corresponding X-value candidates 
may be portrayed by a bipartite graph as illus- 
trated in fig. 2. If it is possible to define 
for this graph a maximal matching which covers 
all the points in the "term set" then it is pos~ 
sible to find the distinct X-line values that 
will be needed. In fig. 2 one of the many maxi-~ 
mal matchings is represented by the heavy lines. 
The algorithm which derives such a maximal 
matching can be found in [4] (the Hungarian 
method) and a faster version has been recently 
presented in [5]. Note that the graph illus- 
trates a certain amount of competition for par- 
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ticular X~-values such as "AR", As one might 
expect, the overall competition for distinct X-~ 
values will be augmented if the number of terms 
is increased and/or the number of X-lines is de- 
creased. Using the algorithms presented in [4] 
or [5] a maximal matching can be extracted rap- 
idly since the graphs that arise in text scan- 
ning are rather easy to match. If the maximal 
matching does not completely cover the term set 
then the uncovered terms are rejected from the 
batch and must wait for the batch being accumu- 
lated for the next scan of the data base, 

Simulation studies have been carried out in 
an effort to predict the severity of this rejec~ 
tion problem. The following table presents the 
results: 


Maximum 
Number 
Rejects 


Minimum 
Number 
Rejects 


Average 
Number 
Rejects 


RAM Implementation M=256 100 trials 


In this table h represents the number of 
X-lines taken from the n middle characters of 
the shift register. In each case 100 trials 
were run with each trial involving a batch of 
256 terms. For example, when 12 X-lines were 
defined using the four low order bits from each 
of the three character positions at the middle 
of the shift register then for the 100 trials 82 
ran with 0 rejection (perfect hashing for all 
terms) while the other 18 required rejections 
but not more than 3 terms were ever rejected. 
The average number of rejections was 0.26 term. 
Terms were randomly selected and varied in 
length between 4 and 32 characters. The lengths 
formed a normal distribution with an average of 
8.5. 3 


MINIMIZING THE REJECTION PROBLEM 


In [6] simulation studies indicate that the 
ASCII encoding of the source text is not neces~ 
sarily the best choice. 

Since the previous design works with 
order bits of each character, the characters 
sponsible for defining the X~-lines may have 
same. appearance since their X~line outputs are 
identical. For example, the ASCII codes for the 
four vowels "A", "E", "I", and "U" all end with 
the bits "01" and hence, if the X-lines use only 
two low order bits these frequently occurring 


low 
re~ 
the 


characters are not distinguished from one 
another. The selection algorithms for the pre- 
vious design would have a better performance (on 
the average, fewer rejections per batch) if the 
binary codes representing the characters were 
distributed in such a way that the codes for 
characters used most frequently (in English lan~- 
guage source text) presented the most variety in 
their low order bits. It can be demonstrated 
that for certain values of n and h the average 
rejection rate can be cut in half by using this 
more suitable encoding of the source text. 

The above strategies are sufficient for 


term detection units which handle 256 or fewer. 


terms. Unfortunately, if n and h are kept fix- 
ed, the rejection rate does not rise in a linear 
fashion as the number of terms increases. For 
example, a unit designed to handle 1024 terms 
with h=12 and n=4 must. suffer an average of 
85.58 term rejections. This many rejections 
could adversely affect many users on such a sys- 
tem. Various design techniques may be used in 
an effort to minimize term rejection, The most 
obvious approach is to increase h the number of 
X~lines. However, since each additional bit 
doubles the amount of RAM used in the mapping 
module it may be more profitable to investigate 
other approaches to the problem. 

Perhaps the most effective alternative is a 
design which tends to alleviate the constraint 
imposed by the uniqueness requirement discussed 
earlier. In fig. 3 and fig. 4 designs are pre- 
sented which allow two terms to share a common 
X-value. In the first case an extra field in 
the string RAM is used as a pointer to a second 
term which may also be compared with the con- 
tents of the shift register. Thus, an extra 
probe is done in a sequential fashion. Note 
that in the time between shifts (one microse- 
cond) the following delay times are experienced: 
the delay in the shift register (40 ns.), the 
access time for the mapping module (100 ns.), 
the access time for two string RAM reads (400 
ns.) and the propagation delay through the com- 
parators (60 ns.). 

In fig. 4 an alternative scheme effectively 
permits two parallel probes to be done simulta~ 
neously. There is extra hardware, consisting of 
two mapping modules and two sets of comparator 
but the performance is faster. 

In both of these schemes 


problem is greatly alleviated as demonstrated by 
the table below which presents the results of 50 
trials each working with 1024 terms. 


Perfect Hashing Extra Probe Allowed 


h n | min:max 


Term Rejections for Batches with 1024 Terms 


the rejection — 
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FUTURE DIRECTIONS . 


The derivation of the required maximal 
matching has been accomplished using graph theo- 
retic algorithms. which are guaranteed to produce 
results in all 'cases even though the general 
problem has a very high "worst case" complexity. 
However, text scanning graphs are easy to match 
and hence it would be well worthwhile to look 
for alternate algorithms in an effort to satisfy 
other objectives more important to the applica~ 
tion, for example: 7 . 


Since a user is effectively rejected if any 
of his or her terms are rejected, it is more 
important to have an algorithm which rejects 
the minimum number of users. 


l. 


In the design discussed above, all user. 
terms are collected and presented in a batch 
to the term detector just prior to a com- 
plete scan of the data base, which may take 
hours in some applications. Response time 
for a user would be improved if his or her 
terms could be presented to the term detec- 
tion unit very soon after query translation, 
perhaps when the disk head moves from the 
current track to the next track, This dy- 
namic addition and deletion of user terms 
would have a significant impact on the de- 
sign of the algorithm used to extract the 
matching. 
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A PARALLEL PROCESSOR ELECTRONIC TARGET SIGNAL GENERATOR 
FOR ELECTRO-OPTICAL SEEKERS 


T. N. Long, J- T. Randolph, and M. J. Sinclair 
Engineering Experiment Station 
Georgia Institute of Technology 
Atlanta, Georgia 30332 


Summary 


The first phase of implementation of a 
flexible, programmable, computer controlled 
Electronic Target Signal Generator (ETSG) has 
been completed. This scene generator will be 
incorporated into hybrid closed-loop simulations 
performed at the United States Army Missile 
Command's Advanced Simulation Center in 
Huntsville, Alabama. Using parallel processing 
techniques, the ETSG can now generate real-time 
signals equivalent to the two spectral channel 
detector outputs of a Passive Optical Seeker 
Technique (POST) seeker such as Stinger/POST. A 
future phase of implementation will permit simu- 
lation of reticle type seekers such as Redeye or 
Stinger. A total of six targets within the 
field of view can now be simulated. Expansion 
capabilities allow for a total of nineteen (19) 
targets. 


The Electronic Target Signal Generator 
(ETSG) is a self-contained parallel processing 
computer which, when given the proper initial 
and dynamic inputs, generates an analog voltage 
that simulates the detector output or outputs of 
an electro-optical seeker. Simulated sources of 
specified shape, size, spatial orientation, and 
intensity gradients can be created and 
controlled within the missile's field of view. 
Simulation of a target/background/countermeasure 


scenario can then be accomplished by selecting 


various sources and combining them to represent 
the various parts of the target signature. A 


high level block diagram of the system is 
illustrated in Figure l. 
During initialization, constants which 


describe a particular seeker type and desired 
source properties are entered through a color 
graphics terminal. Using these inputs (Table 
1), a reference Random Access Memory (RAM) block 
is loaded for each source to avoid unnecessary 
calculations during real-time source updates. 
This loading is performed by an initialization 


processor which is,.a commercially available 
Motorola EXORcisor that contains a 6800 
micro-processor, 48K-bytes of RAM, dual floppy 


disks, and a serial interface. Source reference 
RAM's are 64 x 64 blocks of eight-bit bipolar 
memory which contain the highest resolution 
“map” that the source can have during a mission. 
Presently, two of the sources may be designated 
as “complex”, with three orthogonal views being 
stored in RAM. This configuration, which can be 
expanded to include all sources, permits fly- 
around during a simulation. : 
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For each real-time update of an engagement, 
a high-speed data transfer is initiated by the 
digital computer which dictates source spatial 
orientations. Upon completion of the transfer 
to the ETSG, source central processing units 
(CPUs) are halted and loaded with their respec- 
tive dynamic data (Table 1). These CPUs (one 
per source) are also built around the 6800 
microprocessor and use a 16 x 16-bit multiplier 
to reduce computation time. 


INITIALIZATION PARAMETERS 


SEEKER SOURCES 
Type (POST or reticle) Shape 
Field of View Size Size 


Blur Size 

NEFD 

SNR for Tracking 

Scan Rate (reticle only) 


Aspect Ratio 
Intensity Gradient 
Spectral Band 
Intensity Polarity 
Intensity Program 
Maximum Range 
Minimum Range 


FLARE 
Intensity vs Time 


ENVIRONMENT PULSE JAMMER 
Background Repetition Rate 
Atmospheric Attenuation Sweep Time 

| Duty Cycle 
Period 


DYNAMIC PARAMETERS 


Block Transfer Code Word Aspect 

Range Rotation 
Azimuth Flare and Jammer 
Elevation Control 


Table 1. Simulation Parameters 

Each 64 x 64—byte reference RAM represents a 
source at the range where it exactly fills the 
field of view in at least one dimension. Also, 
the RAM represents the source viewed with no 
projection. Accordingly, a source with a nom 
zero aspect cosine will be smaller than its 
reference. Because of these principles, any 
source can be created simply by skipping points 
in reference RAM as it is loaded into a memory 
array which represents the image plane. The 
source CPUs calculate these skip factors. 


Upon completion of the parallel calculations 
of the CPUs for each update, loader circuitry 
begins loading one of two image-plane “maps” 
(64 x 64-byte blocks of twelve-bit ‘ bipolar 
memory) in each channel. The maps are loaded. 
from the reference RAMs using the values calcu- 
lated and stored in the latches by the source 


CPUs. As the sources are loaded into an image- 
plane map, the intensities from the reference 
“RAMs are contrasted with the background and 
scaled for range (with atmospheric effects 
included) by a floating-point multiplication. 


As sources are being loaded into one of the 
image-plane maps of a channel, points previously 
loaded in the other map of that channel are 
being summed for output in a method which 
depends on the type of seeker being modeled. In 
the POST (flying spot) system, a relatively 
small window (3 x 3, 2 x 2, or 1 x 1 bytes) is 
scanned in a pattern through the field of view. 
The intensities of the points which fall within 
the window are summed and output as a single 
time sample during the sequential sampling. The 
functions of the two image-plane maps in each 
channel are swapped for each update so that one 
map is always being filled while the other is 
being scanned and sampled. 


After the points have been summed, _ the 
background level and detector noise are added 
with the result being converted to an analog 
signal and filtered to remove sampling noise. 
This analog signal represents simulated detector 
output. 


Off-line diagnostics are performed by the 
initialization processor. A significant problem 
is encountered when using a 64K-byte machine to 
address over 200K bytes of memory. This problem 
is solved in- the ETSG by defining memory loca- 
tions DOOO through DFFF as a 4kK-byte window to 
the rest of. the system. The initialization pro- 
cessor first writes an extended address word to 
a latch which is decoded by each memory module 
in the system. Being able to address all memory 
provides an inherent diagnostic capability which 
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Input 
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can be realized with software to eventually 


develop a completely self-diagnostic machine. 


When the ETSG begins dynamic processing, an 
independent dedicated CPU takes control of the 
display. This “display” CPU, which is also a 
6800-based design, has three functions: 1) pro- 
cess incoming source coordinates with missile 
roll angle. to de-roll coordinates, 2) display 
all sources at the de-rolled coordinates with a 
different color for each source, and 3) perform 
dynamic error checking on source ranges and 
dynamic roll rate. . 


Present configuration of the ETSG provides 
for two three~dimensional sources and four, two- 
dimensional sources. Future phases of implemen- 
tation will provide for nineteen’ three- 
dimensional sources. Present ETSG configuration 
provides rectilinear scanning of the field of 
view. Future phases of implementation will pro- 
vide for reticle scanning which will require an 
additional section of parallel processing due to 
the increased number of bytes that must be con- 
volved from the muaeepheue map to generate 
detector output. | 


References 


(1) G. E. Riley and M. J. Sinclair, Electronic 
Analysis, Engineering Experiment Station, 
Georgia Institute of Technology, Final 
Report No. A-2186, May 1980, 54 pp. 


(2) C. E. Barnett and T. N. Long, Integration of 
a Hybrid Simulation for a Small Air Defense 
Missile, Engineering Experiment Station, 
Georgia Institute of Technology, Final 
Report No. A~2333, April 1980, 23 pp. 


| Image 
Plane 
Map A 
Sead Scan 
" D-A 
mage Channel 1 
Plane Detector 
Map B Output 


Scan 
‘Processor 


Image Channel 2 
Plane | Detector 
Map B Output 


Figure 1. ETSG Block Diagram 


288 


DESIGN OF A MIXED VOICE/DATA COMPUTER NETWORK 
FOR PACKET-SWITCHING COMMUNICA TION | 


Jien-Deng Kao, Jin-Tuu Wang, Te-Son Kuo, Ger-Chih Chow 
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SUMMA RY 


In a four-wire telephone system, the channel utilization 
is activated by voice only 38% in one direction during conver- 
sation interval. It is therefore possible to use the remaining 
capacity of silent period for inserting and extracting data in 
packets by using a microcomputer-based controller called 
Mixed Voice and Data Processor (MVDP), which is one of the 
nodes of MVD network (MVDNET), that plays as a value added 
service of existing voice-grade telephone network. 

The voice signal remains its analog form as well as its 
real-time property and is assigned with higher priority than 
data traffic, which can besstored-and-forwarded by MVDP in 
MVDNET. 

The protocol, which is based on the ISO defined open 
system interconnection and CCITT recommendation X.25, is 
adopted to cope with the recently growing up public packet- 
switching data communication. 

This paper describes the design consideration of packet 
switching MVDNET, laying emphasis on details of special 
hardware components, adaptive routing algorithm and hiera- 
chical flow control structure. 

I, INTRODUCTION 

Wang and Liu have proposed a queueing model of a Mixed 
Voice and Data (MVD) transmission system (1). The concept 
arose on the basis of low efficiency of the sampled measur- 
ment of the activity records in satellite circuits for telephone 
conversation, in which a channel is activated in one direction 
for only 38.8% of the time when it is busy (2) . Thus the rest 
of the silent periods can be used for data communication. 

Some analysis and simulation work have been done for the 
proposed system (3), (4). An improved hardware realization 
of Mixed Voice and Data Processor (MVDP) has been imple- 
mented by G.C. Chow (5). . 

This paper describes the considerations in the design of 
a Mixed Voice and Data Network (MVDNET), which consists of 
MVDP nodes, as the communication processors to provide the 
value added service for the existing telephone network. 

In the MVDNET, the data flow in an MVDP node is kept in 
a store-and-forward operation, while the voice remains in its 
real-time nature with higher priority to occupy the channels 
than data packets. In this way, voice subscribers does not 
know the existence of data packets and is not disturbed by data 
packet either. The interrupted data packets then wait for the 
next silent period or find another available channel by most 
recently realeased (MRR) rule to reduce the chance of being 
interrupted again. The methods to process the preempted data 
packet may be preemtive-repeat or preemtive-resume. 


Il. MVDNET Configuration 


‘The data traffic following through the MVDNET is 
generated by data terminal equipments (DTEs). The sources 
and sinks of information (sending and receiving terminals) are 
attached to the MVDPs via local transmission lines (local 
loops), as shown in Fig.1. 

In each MVDP, the process which decides to which 
following MVDPs or DTEs should be sent is called the switch 
module. The switch module makes the routing decision based 
on information it has on the status of the network. This routing 
information may be divided into a local and a global part. The 
local part is readily available at the current MVDP, whereas 
the global part has to be collected from all the other nodes. 
Collection of status information and preparation and distribu- 
tion of the global routing information. is performed by partial 
centralized adaptive routing algorithm. 


Il. Hardware Description , 
We first summarize the important features of the MVDP 


0190-3918/81/0000/0289$00.75 © 1981 IEEE 


a oa ee ome eT 

. 2 NINE. 0a, A 

Links : [ pSTQRAGE HK: : 

rom : ‘ Links to 
Neighbor 4? >>} QIK > =¢ Neighbor 
MVDPs : 7 Loss, ° 


MVDP. 
Mex 
y 


V 
toy;) . ES char oF 
Sending Routing sfafus KB 


Terminal 6)}—aT qos}, Lables 1 Receiving 
5 in@ f Secordary) HC +H) Terminais 
e ie kee cies ABLE Nes 
TLB -—---7 11] een) 
| TLB 
t_—_.y MVDP 


Console 
Legends: Y: transit arrival rate from adj acent MVDPs 


A: source arrival arrival rate from local DTEs 
do: service rate of dispatcher (CPU) 

AL; Mody: link output rates 

Aius1 2 MVDP to local DTEs output rate 


Fig.1 MODEL OF A STORE-AND-FORWARD MVDP 


design: 

(1) It keeps existing voice-grade (VG) telephone network in 
analog form and considers voice signals as_ real-time 
through traffic having higher priority than data packets; 

(2) It retransmits the data packet (preemtive-repeat mode) or 
residual part (preemtive-resume mode) whenever an inter- 
ference occurred due to the arrival of voice signals. 

(3) It uses voice detectors to sense the channel activity status 
and to interrupt the MVDP whenever there is a change of 
status (voice arrival or termination). 

(4) It uses data recognizers to distinguish data from voice at 
the receiving side, therefore, they can be switched 
accordingly by an analog switch that feeds the data signal 
to modem. 

(5) It uses multiple voice-grade channel to achieve higher data 
throughput and provide flexible choice of data-packet 


routing. 
Among the hardware components, the important devices 


different from other existing data networks are voice detector, 
data recognizer and modified Modem. We will describe the 
feature of such dévices in the following sections. 


3-1 Voice Detector 

The voice detector is a very important device and is 
difficult to be implemented due to its sensitivity. Its noise 
rejection figures are directly related to the utilization of 
voice-grade channels. 

Fig. 2, shows the block diagram of the voice detector. 
The Zero Crossing Rate (ZC R) detector is based on important 
criteria to determine voice starting point from noise condition. 
Usually, an unvoiced sound has high zero crossing rate and 
low signal level, while white noise has both low ZCR and 
signal level (73. A low level detector is used to decrease the 
voice dector activity caused by the background noice with high 
frequency and very low level. The Level Detector/Comparator 
is a substitution of the rms detector to avoid the 10 ms delay 
time of rms detector, and it has higher rejection of impulse 
noise than rms detector. 


Zero Crossing 


Voice Rate Detector oe 
Detector retriggerable 


one shot 
Level Detector 
{ZComparator 


Fig. 2 Voice Detector Block Diagram 


3-2 Data Recognizer 

The purpose of this data recognizer is to distinguish data 
from voice signal by means of "signal pattern recognition 
method'' according to the frequency spectrum and amplitude 
distribution of voice and data carrier. 


There are four criteria to recognize data signal from | 


unregular analog voice signal: 

(1) Data carrier is one frequency component of voice signal 
which may trigger a normal voice detector. 

(2) Data signal is transmitted in high level amplitude and with 
‘much more energy than most of voice signals. 

(3) The pure data carrier is a signal tone signal. Its hamonic 
distortion should be less than 5 percent. 

(4) The frequency components outside the carrier bandwidth, 
during the transmission of the pure carrier are 20 db to 
40 db lower than inband signal. This criterion is the 
major role to recognize the existence of data signal. Fig.3 
shows the implementation of Data Recognizer using the 
above mentioned four criteria. 


3-3 Modified Modem 

To achieve higher data rate, the modem is designed using 
the 4 phase PSK technique. An advantage of PSK modem is its 
ultra-low error rate. . 


V. Buffer Management Policy 


Buffer management policy in subnet of MVDNET is based 
on two rules: | 
(1) Output queue length limit. 
(2) "Transit" traffic is assigned a higher priority than "New" 
traffic. 


Two-priority scheme.is proposed whereby transit traffic 
(consists of packets that have traveled over longer one ormore 
hops) has a higher priority than new traffic as regards to the 
buffer allocation. This is done by setting a limit on the number 
of occupied buffers, beyond which new traffic is rejected and 
transit traffic may be accepted. Let | 
B = the total number of buffers in an MVDP. 
L =a threshold allocated value. 
N = the number of outgoing links ina MVDP and bmax be 
the maximun length of an output queue (where b max > 
B/N). 

M=the number of allocated buffers, nj of which are 
utilized by link i, upon arrival of a packet route to 
link i. . 


Under the constraints @)0<ni<bmx ;(b) J nj SB, 
i 
the following decision rule is applied: 


(1) I LXM<B and nj < bmx » than accept thetransit 
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Fig. 3 Data Recognizer 


IV. Communication protocols for MVDNET 


Most important among the functions of the communication 
protocols are error, flow and congestion control, and routing 
strategies. In the MVDNET, there is a protocol hierarchy, as 


shown in Table 1, its structure is based on the seven levels — 


open system interconnection defined by ISO (8). For the sake 
of public data packet transfer application in existing analog 
telephone networks, Level 1 is CCITT X.21 bis (9), while 
Levels 2 and 3 are the corresponding levels of CCITT 
X.25 (6). Transport level is supposed to ensure reliable 
source DTE-to-sink DTE sequenced delivery; one candidate is 
TSS25 (10). The Session Layer deals with the establishment 
of logical relationships between application entities, the main- 
tenance of this relationship, and the handling of dialogue. The 
Presentation Level deals with data structuring. It contains a 
set of protocols which depend on the type of data to be ex- 
changed. The Application layer deals with the user interface 
functions which include all service facilities such as security, 
multi-address working, service type, service offering, etc. 


Table 1. MVDNET Protocol Hierarchy 


Protocol level _ 
7. Application Level _ 


6. Presentation Level 
Logical relationships between 


5. Session Layer 
: application entities, etc. 


4, Transport Service TSS 25, DTE-~TO-DTE control. 


The user/service interface. 


Data formatting features. 


3. Network Access Permanent or switched virtual 

; circuit —— 
Asynchronous Balance Mode 
of HDLC . 


CCITT X. 21 bis. 


2, Link Level 


1. Physical Level 
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MVDNET i.e. path routing level and channel routing level 
under selected link. Fig. 4 shows the routing levels. The path 
routing level in MVDNET uses a database describing the net- 
work to generate a tree representing the minimum delay paths 
from a given root MVDP to every other network MVDPs. The 
VG channel routing level of MVDNET selects a silent VG 
channel from the provided several Voice-grade (VG)channels 


between MVDPs. 


Path Routing Level 


Voice-grade 
Channel Routing Level 


‘40: 


N 


oe Heo Go) [ore 


Fig.4 Routing levels in MVDNET 


6-1 Path routing level 
The adaptive routing procedures(11)for path 
level of MVDNET are described as following: 


routing 


Routing computation—The First Two Shortest Paths (FTSP) 

Algorithm 

(1) The basic FTSP algorithm uses a database describing 
the network to generate two tree representing the first 
two minimum delay paths from a given root node to 
every other network node. 

(2) To implement the primary and secondary tables. 
according to the first and secondary minimum path 
trees respectively. | 

Routes in MVDNET are assigned on a_ single-path-per 
Virtual Circuit (VC) basis. The assignment of a route to a 
switched virtual circuit is established temporarily or 
permanently. 


6.2 VGchannel routing level 

Between hop levels (MVDP-to-MVDP), several four-wire 

voice grade channel are provided to improve the network 

throughput and lower the delay time through the MVDP hop. 

In order to embed digital data packets into the silent of 

VG channel, we present the voice detectors at transmitting side 

to monitoring the VG channel status (talking or silent period) 

(1) The interrupt events of voice arrival or terminating are 
recorded into the VG channel status table. 

(2) The channel number, in which the voice is terminating, is 
push into the most recently released (MRR) stack. 

(3) If any packet is to be sent, then it takes a VG channel 
number from MRR stack, 

(4) After the packet has been sent, then check the VG channel 
is still in silent state or not? If true, then push the 
number of VG channel into MRR stack. 


VII. Flow Control Policy 


In MVDNET, the flow control is designed in hierachical 
multilevel structure, and these levels are actually embedded 
into corresponding levels of protocols. 


7-1 Link level flow control 

Link level flow control in MVDNET is carried out by the 
HDLC protocol (6,8) , in asynchronous balanced mode (ABM). 
It operates in a local way that it monitors local queues and 
buffer occupancies at each node and rejects transit (from 
adjacent MVDPs) traffic at the node when some predefined 
thresholds (e.g. maximum queue limits) are exceeds. 

A physical network path is set up for each user session 
and is released when the session is terminated. Sequeneing 
and error control are provided at each step along the path. In 
addition, it permits the application of selective flow control to 
each individual VC stream. Packet buffers in MVDP are 
dynamically allocated to VC's based on demand (complete 
sharing), but thresholds are set on individual VC allocations as 
well as on overall buffer pool utilization. 

7-2 Entry-to-exist Level flow control 

In MVDNET, in which a fixed route is assigned to each 
user session during setup time, the entry-to-exit flow control 
is applied individually on each virtual circuit. 

When the source DTE transmitting rate exceeds the sink 
receiving rate, the flow control mechanism intervenes to slow 
down inputs from the source DTE into the entry MVDP. The 
window size W must be large enough to permit each virtual 
circuit to efficiently utilize the bandwidth available on thepath. 
7-3 Network access flow control 

The network access scheme in MVDNET is similar to 
input buffer limit proposed by Lam (12). The different point 
is that in the former case an new input packet from external 
souce DTE is discarded if the total number of the packets in 
the entry MVDP exceeds a given threshold, while in the Lam's 
scheme an input packet is discarded when the number of input 
packets exceeds a given threshold. Transit packets can freely 
claim all the buffers. It is clear that the network access 
control in MVDNET prevent congestion by a transit 
traffic over input traffic. 


7.4 Transport level flow control 

The transport flow control is based on a window mecha- 
nism as in LL, ETE levels. Namely, the receiver grants 
transmission credits to the sender as soon as reassembly 
buffers become free. Upon receiving a credit, the sender is 
authorized to transmit a message of an agreed-upon length. 
When reassembly buffer become full, no credits are returned 
to the sender, thus temporarily stopping message transmis- 
sions. 


Vif. Conclusion 


In this paper we have proposed a mixed voice and data 
communication network (MVDNET) as a value added serviceof 
existing VG telephone network by using the MVD processors 
(MVDP) as packet switching nodes. 
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be stored-and-forwarded by the MVDP. 


The analog voice signal, due to its real-time property, is 
assigned with higher priority than data traffic. The latter can 
Therefore most of 
conventional computer network protocols, buffer management 
policy, adaptive routing algorithm and flow control scheme are 
suited for data transfer in MVDNET. The ISO defined hiera- 
chical structure of protocols is adopted so as to cope with the 
growing-up public packet-switching data communication. 

Routing and flow control procedures have traditionally 
been developed independently in packet networks, however, both 
are brought together into useful cooperation in MVDNET, which 
is a virtual call network, where a path must be selected before 
data transfer on a user connection begins. The routing 
algorithm is invoked first to determine whether primary route 
of sufficient residual bandwidth is available, else test 
secondary route next. If both paths are congested, the virtual 
circuit connection is blocked immediately at the entry node by 
the network access flow control level, thus preventing conges- 
tion rather than allowing it to occur and then attempting to 
recover from it. 

In the future design, the modem using coherent detection 
may be used to obtain higher speed. Also, the CPU set may 
use the bipolar microcomputer with memory that quick access 
time. Thus, the MVDNET will become a high throughput, in- 
expensive and standardized network which provides improved 
efficiency and reduced cost public packet-switching data com- 
munication service from existing VG telephone network. 
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A NEW TYPE OF MIMD- ORGANIZED MULTIPROCESSOR 
HANDLING TWO- STAGE PARALLELISM BY MEANS OF A 
DYNAMICALLY. CONFIGURABLE ARCHITECTURE 


R. Buhrer 


ETH ( Swiss Federal Institute of Technology ) 
Zurich, Switzerland 


Summary 


At ETH, a new simulation package PSCSP (power- 
series continuous simulation program) for the simu- 
lation of continuous systems is currently in the 
final stage of development [1], [2]. The main at- 
tributes of its integration technique can be summa- 
rized as follows: 


When applying the method of power series expan- 
sions to the integration of ordinary differential 
equations, the right-hand sides of the equations 


have to be decomposed into elementary expressions. 


Using a set of library routines (FORMULAS) for 
these expressions, the higher derivatives can be 
evaluated analytically when calling all FORMULAS 
repeatedly with increasing orders of expansion. 
The parallelism has a two- stage nature: several 
independent FORMULAS can be evaluated concurrent-— 
ly (first level of parallelism) while most of 
them feature an internal parallel structure 
(second level of parallelism) that can be ex- 
ploited simultaneously by a number of processors. 


In order to acquire reliable information about 
the maximum gain in speed of a parallel PSCSP com- 
pared with the sequential version, a multiprocessor 
was built whose design and programming is consis- 
tent with the special nature of the encountered 
parallelism. This processor, whose main parts are 
outlined in figure 1, is based primarily on an 
MIMD (multiple- instruction stream - multiple- data 
stream) parallel computer concept [3] improved by a 
new dynamically configurable architecture principle 
[4]. In addition to a supervisor processor, whose 
activities are input/ output, compilation tasks and 
supervision of FORMULA- executions, a set of 16 
execute processors (EP), loaded with identical 
software, are used for execution of the FORMULAS. 
Due to the individual activities of these EPs 
within the same or different FORMULAS, the execu- 
tion of every FORMULA necessitates the creation of 
cooperating EP- groups whose members are distri- 
buted arbitrarily. This is done by the job control 
unit which generally serves as the EP- dispatching 
logic. In order to handle the numerous data trans- 
fers between cooperating processors in an appro- 
priate way, a new interconnection memory intercom 
was developed, whose great advantage is the fact 
that a result provided by any of the EPs is made 
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immediately available to all other processors; i.e 
between any two processors data can be exchanged 
simultaneously without any delay. This intercom 
consists of a quadratic organized memory- matrix 
c= (cij) (i,j = 1,...17) whereby processor k 

(1 < k < 17) duplicates its data into all Ck 

(j = 1,...,17) elements of its associated row. 
Reading is possible in all cj, (i = 1,...,17) ele- 
ments of its associated column. It can be shown 
that possible data protection problems can be eli- 
minated if - in a restricted sense - data are only 
duplicated into the matrix elements of those pro- 
cessors which are working on the same FORMULA. 


Whenever the job control unit starts a group 
of EPs, a configuration information is transmitted 
to the intercom interfaces of each EP of that 
group, allowing a transformation of job- specific 
address modes as "relative left (right)- or abso- 
lute within a group" into physical counterparts by 
means of hardware. Additional logics such as the 


result transfer unit (for direct- memory- access 


result transfers from the EP region to the super- 
visor processor region of the intercom) and a 


serial link for deadstart and maintenance tasks 


complete the system hardware. 

Extensive performance measurements of the 
system (which is fully operational Since the 
spring of this year) proved that the multiproces- 
sor allows an exact determination of the gain in 
speed achieved by the parallel PSCSP. This is due 
to the fact that the system overheads can be sepa- 
sulting from hardware compromises such as the use 
of standard LSI- 11 EPs instead of fast, flexible 
microprocessors, or from the use of conventional 
RAM- memories instead of dual- access memory chips 
in the intercom, etc.) and principal overheads 
based on the specific system architecture. In the 
worst case these overheads turn out to be 27% and 
10% respectively in relation to an optimally pro- 
grammed sequential PSCSP- version. As shown in [4], 
present- day hardware technology allows the reali- 
zation of a system whose technological overhead is 
almost zero. . 
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Figure 1 - Hardware of the Multiprocessor 
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PARALLEL PROCESSING IN COMPUTER COMMUN ICAT IONS 
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Summa. ry 


The aim of this paper is to analyze, by simla- 


tion, the advantages of parallel processing in com 


puter communications. In such a study the parame- 
ters of both the protocol and the hardware confi- 
guration characterizing the commnication system 
are taken into consideration. A simulation model 
is presented which is to perform both the analysis 
and the synthesis of the processes in such a com- 
munication system. Finally, using this model, some 
results on the parallelism obtainable inside a 
transport protocol are shown. 


Int roduction 


With the grand development of distributed pro- 
cessing systems, considerable efforts have been 
made in the experimental and theoretical field of 
computer networks in order to obtain reliable and 
correct interaction between users and applications 


through a Communication Device (CD) (fig.1). As is. 


well known, such an interaction arises by means of 
a message exchange following a suitable set of ru- 
les (protocol). Because of modularity and flexibi- 
lity reasons CD offers the users its service bya 
set of protocols which are hierarchically organiz-— 
ed in layers; each protocol being performed by a 
pair of Communication Processes (CPs) generally 
running on remote computers. So besides the proto- 
col at the application layer, other protocols 
at the lower levels exist in modern networks |1| 
(fig.2)% 


Since each CP can perform its functions at the 
same time as the functions performed by all the o- 
ther CPs, we are faced by the problem of determin- 
ing under what conditions. parallel processing can 
be useful in computer conmunications. Two main so- 
lutions are possible to implement such CPs : imple 
mentation inside main frames or minicomputers and 


implementation in a multimicrocomputer environment. 


Usually in the first solution different CPs are im 
plemented in the same machine thus determining a 
sequential processing of the functions performed 
in the above multilayer structure. On the contrary 
the second solution makes the implementation of 
each CP on a single machine possible and at reaso- 
nable costs, thus obtaining a parallel processing 
which uses a multimicrocomputer structure. 


The aim of this paper is to analyze, by simula-,_ 
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tion, the advantages of parallel processing in 
computer communications. In such a study the para 
meters of both the protocol and the hardware -con- 
figuration. characterizing the communication sy S— 
tem are taken into consideration. In particular 


sect.2 discusses the possibility of parallel pro- 


cessing in computer networks. In sect.2 a simla- 
tion model is presented which is to perform both 
the analysis and the synthesis of the CPs in the 
above multilayer structure. Finally sect.4 first 
analyzes how the hardware configurations can in- 
fluence the performance of a transport protocol, 
then it shows some results on the parallelism ob- 
tainable inside such a layer. 


Parallel processing 


As is well known the communication structure 
of the computer networks is organized in a mlti- 
layer hierarchical architecture. Each layer con- 
sists of CPs and offers suitable services concer 
ing the data transfer from one process to another 
of the upper layer. The service offered by each 
layer is obtained by means of messages exchanged 
between the CPs inside the layer depending on 
suitable communication protocols. In such net- 
works, the decomposition is not only applied to 
partition the functions of the network into sub- 
functions to be performed by each layer |1]|, but 
it is also used to study the internal structure 
of the CPs inside the layers |2|. For example a 
CP can be partitioned into three main subpro- 
cesses which manage respectively the information 
exchange with the upper layer (upper interface), 
‘with the lower layer (lower interface) and with 
the remote partner (protocol unit). Of course the 
degree of decomposition depends on the purpose of 
the study. For example in defining and validat— 
ing protocols and services, suitable decomposi- 
tion of the CPs may be useful in simpifying the 
problem. Similarly in implementing the CPs, one 
can apply decomposition techniques within ‘cer- 
tain limits in order to obtain a physical realiza 
tion of such CPs with the desired modularity, fle 
xibility and speed characteristics. This paper 
deals with parallel processing and then we are in 
terested in decomposition techniques and hardware/ 
software architectures which allows us to obtain 
a suitable degree of parallelism. So we take into 
special consideration the implementation of such 
CPs and, possibly, of their parallel subprocesses 
in multimicrocomputer environment. 


The multimicrocomputer structures for network- 
ing have been widely analyzed by the authors in 
preceding papers |3|,|4|. Generally two solutions 
are possible the first consists of a structure 
in which a microboard performs a CP, on the other 
hand in the second different CPs are implemented 
in the same microboard. In addition the micro- 
boards can interact by means of a shared memory 
area which can be obtained either from various me- 
mory chips or a single common memory chip. Of 
course a suitable analytical or simulation study 
is necessary to evaluate in detail the influence 
of the kind of implementation on the commnication 
structure performance (as for example the through- 
put, the delivery delay or the reliability). Due 
to the complexity of such a problem a simlation 
based approach is proposed in the next section. 


A Simulation model 


A computer -network consists of CPs which are va 
riously interconnected through communication chan- 
nels. A CP can be schematized as a device with two 
inputs, two outputs and a set of intemal states 
|2|. The inputs and the outputs of a CP at the la+ 
yer N regard respectively the services offered .by 
the layer N and by the layer N-1. Messages concern 
ing respectively the protocol at the layer N+1 and 
at the layer N.are embodied in such inputs and out 
puts. The set of rules a CP follows in producing 
all its outputs is called global procedure and de- 


pends on all its inputs and/or its internal states. 


The global procedure can be subdivided in three 
main subprocedures : P concerning the message ex- 
change with a remote CP at the same layer, Q and 
U concerning respectively the command exchange 
with the adjacent CPs at the upper and lower lay 
ers. In addition such subprocedures can be parti- 
tioned into two other subprocedures 

- Pr »Qp and Up which are involved in the transmis- 
sion of the messages inside P,Q and U; 

- Pp,Qp and Up which are involved in the reception 
of the message inside P,Q and U. 

Two fundamental problems arise in managing such an 
internal structure the first concerns the envi- 
ronment in which the subprocedures of a CP are im- 
plemented, the second regards the environment in 
which the internal and interface buffers of a CP is 
implemented. Thus we introduce a CP supervisor 
which follows suitable rules in order to solve 
such problems |5|. 


It is easy to understand that the simulation of 
the communication structure is very complicated be 


cause of the numerous interacting CPs. Therefore 


we propose a down-top approach which allows. the 
network simulation, layer by layer,by using a re- 
current structure. Such a structure consists of 
two CPs interacting through a lower level CD. In 
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this way it is possible to characterize the whole 
communication structure of the CD connecting the — 
end-user processes. Such an approach allows us not 
only the analysis but also the synthesis by solv- 
ing the following general problem step by step 

- Given the commnication device CD at the layer 
N-1 (that is for example the capacity, the delive- 
ry delay, the error probabilities etc.) and the in 
put traffic of CD at the layer N; 7 
~ Optimize CD at the layer N (that is minimize the 
delivery delay in interactive systems or maximize 
the capacity in batch systems or both in general 
purpose systems etc.); 

- Over all the possible protocols, interfaces and 
hardware configurations. 

An high interdependence exists between two adja- 
cent layers : the first concerns the non linearity 
of the parameters relative to a CD which depend on 
the working load coming from the upper layer; the 
second arises when the CPs of different layers are 
implemented in the same environment. For this rea- 
son it is necessary to characterize each layer by 
determining the performance depending on the work- 
ing load. In our model starting from the knowledge 
of the delivery delay, the capacity and the error 
probabilities at the first layer, we obtain the 
same parameters for the CD at the second layer and 
the processing rate and memory acces time depend- 
ing on the working load coming from the layer 3and 
so on. The processing rate is used to simulate the 
critical regionsrelative to the CPU while the me- 
mory access time is used to simulate the critical 
regions relative to the data structure. In fact a 
CP runs according with the constraints with the lo 
wer level CD relative to such critical regions. 


Results 


A simulation program written in SIMULA has been 
derived from the above model. The first re 
sults regard an and-to-end layer provided with 
window flow control and recovery implemented at 
the upper level of a link layer. Three hardware 
configurations are examined in this paper as shown 
in fig.3. For each of the configurations, the 
throughput, the delivery delay and the memory occu 
pation depending on the window size are shown in 
fig.4. Similar diagrams can be drawn for all the 
other protocol parameters (as for example the time 
outs and the message repetition nunber in the reco 
very phase). This allows us to optimize such para- 
meters and to choose the best recovery and flow 
control strategies and hardware configuration. Fur 
ther improvments are obtainable with a parallel 
transmission and reception of the protocol as 
shown in fig.5. However a cost analysis must be 
performed in this case to evaluate the advantage 
of such a parallelism. The use of this program is 
now planned to study the structure of a local net. 
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Fig.3 - Various hardware configurations to im- 
plement transport and link processes (i.e. CP3 
and CP2). 
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Abstract 


A multiprocessor machine approach for paral- 
lely executing SIMULA programs has been recently 
proposed by the authors. As the constituent parts 
of the SIMULA programs at execution level are 
"processes", parallelism in the SIMULA machine is 
considered at process level. The aim of this paper 
is to present the problems of processes synchro- 
nization which arise during the parallel evolution 
of processes within the SIMULA multiprocessor ma- 
chine. The issue of mutual exclusion in critical 
sections in such a system is discussed, and, ac- 
cording to the type of interprocess communication, 
corresponding modes of process synchronization 
are described. It is argued that semaphores and 
semaphore-queues rather than monitors are the 
most effective tools for coping with the synchro- 
nization phenomena in the Parallel SIMULA Machine. 


Introduction 


In a recent paper [1] we have presented the 
basic principles concerning a parallel SIMULA 
machine architecture. The SIMULA language was ana- 
lyzed, constructs were proposed for detecting pa- 
rallelism in SIMULA programs and rules were esta- 
blished to permit parallel execution of such pro- 
grams in a multiprocessor environment. 


SIMULA is a specially designed language 
which is particularly offered for describing and 
efficiently simulating large scale systems. These 
systems reveal significant potential parallelisn, 
which although reflected in the SIMULA programs, 
remains unexploited by the implementation of the 
language within a uniprocessor environment. In 
contrast, a parallel SIMULA machine will provide 
the capability of achieving faster processing ra- 
tes since any SIMULA program can proceed in pa- 
rallel within the host multiprocessor system. Pa- 
rallelism in the SIMULA machine is considered 
at process level. 


In this paper we discuss the process synchro- 
nization problems which arise during parallel evo- 
lution of processes within the SIMULA multiproces- 
sor machine. For reasons of presentation we out- 
line the main features of the parallel SIMULA 
scheme which has been presented in [1]. 


The constituent parts of any SIMULA program 


tem variables". System variables include procedu~ 
res, arrays, simple variables, and references. 


The parallel SIMULA scheme is based upon 
a particular structure called the SIMULA Process 
Interaction Structure (SPIS). The SPIS 
caters to the parallel evolution of SIMULA pro- 
cesses by providing suitable information obtained 
both at compile and run-time levels. The SPIS 
employs a recognition mechanism, called the 
"SIMULA Parallel Process Recognizer" (SPPR).This 
recognition mechanism scans the SIMULA source 
program text and produces two kinds of table 
structures, the "System-Variables" -Table (SV- 
Table) and the "Class Templates" (CT's). The SV- 
Table contains all system variables which are ac- 
cessible in the various SIMULA classes, and hence 
are manipulated by their dynamic instances known 
as processes. A CT provides a detailed record of 
actions of each particular class. Within each CT 
there appear such information as the system va- 
riables accessed by the corresponding class, the 
communication commands issued on behalf of this 
class, and the classes possibly affected by these 
commands. A CT also denotes identification of 
classes on a producer/consumer basis, (P/C-classes). 


Since each SIMULA process originates from a parti- 


cular class, the information contained within a 
Class-Template, potentially reflect the process’ 
interaction pattern and the P/C-process classifi- 
cation accordingly. It is important to establish 
the interprocess communication pattern before de- 
ciding parallelism at execution time. To support 
and maintain run-time information the "System 
Sequencing Set" (SQS) has been appropriately ex- 
tended (E-SQS) so as to accomodate additional re- 
cords on top of these used by the sequential SIMU- 
LA environment. 


In order to ensure the correct evolution of 
processes within the multiprocessor environment 
an "Executive Algorithm" has been developed.This 
executive algorithm imposes a proper communication 
link between SPIS and the "Extended Run-time" sys- 
tem (E-RTS) needed to support SIMULA programs 
during their parallel execution. The executive 
algorithm is implemented by means of a "controller 
processor" whose main functions are to dispatch 
processes to processors, to deal with synchroniza- 
tion phenomena and prevent the occurence of dead- 
blocks. 


The structure of the SPIS and the dispatching 
rules that should apply for an efficient allocation 
of SIMULA processes to processors have been already 
presented in [1]. In the following we consider 
process synchronization within the parallel SIMULA 
machine. 


are called "classes" at program definition level, 
and "processes" at execution level. SIMULA processes 
might be disjoint, but generally they are inter- 
active. Process interaction is either affected by 
means of various "communication commands" ((re) — 
activate, wait, hold, passivate, and cancel) issued 
by the individual processes, or by sharing rela-_ 
tionships that extend over global data called "sys- 
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SIMULA Process Synchronization 


An attempt towards intervening within a SI- 
MULA program structure so as to enable it to be © 
executed in a parallel fashion should be focused: 


(i).0on successfully manipulating the pattern of 
possible transfers of control through the consti- 
tuent SIMULA processes, and (ii) on implementing 
proper sharing relationships on system variables 
accessed by these SIMULA processes. This approach 
should be implemented in a well defined manner so 
as to allow deterministic behaviour on behalf of 
the program and preserve integrity of system va- 
riables. 


When SIMULA processes are executed in paral- 
lel the outcome of their actions depends on their 
relative speed of execution. The speed of SIMULA- 
processes that run asynchronously is affected by 
their frequency of interaction. To achieve success— 
ful cooperation there are specific interaction 
points at which SIMULA processes must synchronize 
their actions. Therefore, a SIMULA process should 
be prevented from proceeding beyond certain inter- 
action points that require some activity by other 
processes. 


In the parallel SIMULA scheme there exist 
two types of syncronization problems: critical 
sections and deadlock These problems affect the . 
evolution of processes and are due to the accessing 
of system variables and to the issuing of SIMULA . 
communications commands. A specific procedure of 
the "Executive Algorithm" called "SIMULA synchro- 
nizer" receives information from the SPIS con- 
cerning process interaction points, and applies 
the required solution. 


In a multiprocessor environment, SIMULA pro- 
cesses may refer to, and modify, system variables 
within blocks of statements known as "critical 
sections". In the parallel SIMULA structure, cri- 
tical sections consist of system variables instead 
of physical resources, which is the usual approach 
in the operating systems concept. Applying mutual 
exclusion in critical sections achieves synchroni- 
zation of SIMULA processes, preserves integrity 
and consistency of system variables and quarantees 
program determinacy. Furthermore, deadlocks occur 
when nested critical sections of system variables 
are encountered. As stated by Hansen in [2], 
absolute hierarchical pattern of communication 
established among processes will assure deadlock 
elimination. In the E-SQS of the parallel SIMULA 
scheme, hierarchies are established according to 
the unique event time associated with each process. 
As a consequence, this hierarchical order prevents 
deadlock occurence during the parallel evolution 
of processes. 


Mutual Exclusion in SIMULA Processes 


In this section we consider the necessary 
criteria for an efficient solution of the synchro- 
nization problems, and we investigate the funda- 
mental synchronizing modes which are handled by 
the "SIMULA Synchronizing Procedure" (SSP). The 
well known principles for an efficient solution of 
synchronization problems have been adapted so as 


to suit the parallel SIMULA machine requirements. 
These synchronizing principles are as follows; 
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(a) Only one process is permitted to be inside a 
critical section at any time instance. 


(b) When a process is inside a critical section 
other processes trying to enter this section 
will be delayed. 


(c) The priority rule used to determine which pro- 
cess will enter a critical section is based 
upon the time hierarchy among processes. This 
time hierarchy results from the event times 
associated with each particular process as it 
appears within the E-SQS. The process asso- 
ciated with the least event time will precee 
first. 


The synchronization phenomena evoke runtime 
action that is provided by the SIMULA synchronizing 
procedure of the "Executive" algorithm. This algo- 
rithm deals with two fundamental synchronization 
modes; 

(i) the V-mode, specifically related to the syn-:. 
chronization imposed by the critical sections,and 


(ii) the C-mode, related to the synchronization 
imposed by the SIMULA communication commands with- 
in the critical sections. 


In the following we critically examine syn- 
chronizing cases representative of each synchroni- 
zation mode. All generalizations of these cases 
are solved by employing identical synchronization 
techniques. It is assumed that all of the processes 
being considered have been allocated to, and run 
on, a processor according to the dispatching rules 
introduced in [1]. 


Case (a): Let {Py Lx} ,tp I//PoLx tp 1} denote 


any two SIMULA processes running in parallel,where 
x1 is their common system variable (i.e. a common 


eritical section), and tpy> tps their associated 


event times respectively. Assuming that ty Py po? Pi 


establishes a higher priority of execution over P9. 
Regardless of the relative execution speeds con- 
cerning P; and Pj, Py should suspend its execution 
action prior to entering its section, while Pj is 
allowed to proceed inside its associated critical 
section. When Pj leaves its critical section then 
"SIMULA Synchronizing" procedure (SSP) of the exe- 
cutive will immediately signal the processor asso- 
ciated with P» to resume its execution. 


Case (b): Let two SIMULA processes {Pj [x s 
x2 itp, 1//Polx]»x2,t poli; run in parallel. Fig. re 


shows a possible arrangement of the processes ' "Sys- 
tem variables" x, x9 and their associated critical 
sections. . 

Py is delayed (blocked) at the critical sta- 
tement which contains x9. Process P, continues exe- 
cuting its critical section statements and leaves 
its critical section at x9. At this point P92 resu- 
mes execution. It can be shown that the general 
case of n-processes accessing K-system variables 
can be approached in a similar fashion. 


The synchronization between processes becomes 
more complicated when processes start issuing SIMU- 
LA communication commands from inside their criti- 


cal sections. In such situations a more elaborate 
solution is required. 


Let {PyLx], t.tp,]//Polx1,tp, I}, where ¢ indi- 


cates any possible SIMULA communication command.It | 
is necessary to investigate the nature and the ef- 
fects caused by each individual command. 


It is evident that Py is delayed until Pj 
deals with the issued communication command. Ac- 
cording to the specific command, action is taken 
as follows: 


(a) Hold (T)-command: It advances the event time 


of Pj by T, te. tp “tp, tT. It follows that: 
(i) If tp; <tpo» then P; continues execution 


while Po still remains blocked expecting P, to 
leave its critical section. 


(ii) If ty gs then Py resumes execution 


while P; is delayed until Py permits resumption 
of Pj at the proper time. 


(b) Passivate, Wait (S), Cancel (P,)-commands: 


These commands cause removal of Pj; from inside 
the ESQS with a subsequent loss of its allo- 
cated processor [1]. Therefore, Py resumes exe- 
cution and P, will not be effectively involved 
in the computation unless it is explicitly 
called by some other active process at a later 
Simulation time instance. 


(c) (Re) activate (P3)-commands: Fig.1)shows the 


situation where a process P3 is (re) called 
back into the ESQS. In [1‘] it is stated that 
if a process Py; issues a '"(re) activate" com- 
mand, process P3 is always a Consumer (C) 
-process in relation to Pj which is a producer 
(P)- process. In such a case process P3 is re- 
placing process Py in its processor,because 
processes belonging to the C-set of a P-process 
all run under the same processor.The following 
subcases can emerge: 


(i) Let tp, <tpy<tp or tp < tp, <tp,: in this 


3 1 
case these relations result from any non- 
immediate activation command [3]. As a result 
P; contunues being active, P9 is temporarily 
delayed (blocked) while P3 does not claim P's 
processor as it exists in a suspended state in- 
side the ESQS. P3 will obtain P;'s processor 
only when P, is removed from inside the E-SQS. 
At that instance synchronization problems might 
accrue between P3 and P29. They will naturally 
fall into one of the described cases. 


(ii) Let tps tp, <tp,? this relation is the 


outcome of the execution of any immediate acti- 
vation command. As a result P, still remains 
inside the ESQS but its processor is allocated 
to P3 which commences execution. Pog still re- 
mains blocked until a later instance at which 
P, leaves its critical section, thus permitting 
unblocking of P9. This is expected to happen 
after a possible loss of control on behalf of 
Pz (e.g. P3 will be either terminated or passi- 
vated, or suspended through the appropriate 
command). 
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The C-mode cases which were examined above, 
can also be generalized so as to include processes 
accessing more than one system variable and 
issuing more than one SIMULA communication command 
from statements inside their critical sections. 
All these generalized cases are approached in an 
analogous way. 


Synchronization problems are normally solved 
by implementing various indivisible operations 
on such structures as semaphores [4,5] and moni- 
tors [6,7]. 


There exist several reasons that discourage 
the implementation of monitors in the synchroni- 
zing procedure of the "Executive" in the parallel 
SIMULA machine. Each monitor in the operating 
system theory is implemented to manipulate a phy- 
sical resource according to some scheduling rules. 
In conventional systems the limited number of phy- 
sical resources usually makes possible the choice 
of monitors. On the contrary the high and unpre- 
dictable rate of appearance of system variables 
within SIMULA programs is not encouraging an effi- 
cient implementation based on the use of monitors. 
Therefore semaphores and semaphore-queués ‘are con- 
sidered to be an effective tool to cater to the 
solution of synchronization phenomena within the 
parallel SIMULA Machine. 
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Architecture of the First 
Vector Computer of China 


Gao Qing-Shi Zhang Xiang 
Institute of Computing Technology,Academia Sinica 
Peking, China | 


The first vector computer of China will soon 
be examined with computational test programs. It 
is a large scale, high-speed, pipeline machine. 
Most of devices were designed and produced in 
China, The main units of the system are all made 
of domestic products. Architecture of the system 
was proposed in 1973[1,2]. The designs of archi- 
tecture and system function were completed in 
1974, and were examined by a national meeting in 
1975 [3] - In this system vector registers are ad- 
opted, and the vectors are treated in vertical- 
-horizontal processing fashion (i.e. segment by 
segment fashion). Fig. 1 shows the vector compu- 
ter system organization. The system consists of 
three parts: the vector computer, the peripheral 
computer, and the peripheral devices. 


Various units and their functions of the 
vector computer are as follows; 

1. ALU. It is a pipelined executing unit. It 
performs various scalar and vector operations. 
For most of the operations, it can accept a set 
of new operands each clock period, meanwhile 
issue one computation result. ) 

2, Re--Vector Registers. There are twelve of 
them: Ro~Ry. Every Ri has 16 elements of 64 bits 
each. They supply the ALU with vector operands 
and temporarily store the intermediate computa-— 
tion results at high speed. They are the basic 
tools that vector operation can be performed in 
segment by segment fashion. “HiOR}> Rk is a vec- 
tor instruction, ® represents one of the arith- 
metic-logic operations, Rj and Fj supply the ALU 
with two operand-vectors, the result—vector of 
operation will be stored into Ry. 

Any Ry can be used as a scalar register when 
it is not being used as a vector register. In 
this case only a particular element of Kj is 
used as the scalar register Rj. 

3. S--High-speed Scalar Memory. It has 32 wor- 
ds of 64 bits each: S9~53; In scalar program, § 
serves as the backup of scalar registers. In 
vector program, S holds the scalar values or 
constants which participate in vector operations 
with vectors. 

4. Lookahead-Fetch-Vector-Buffers. There are 
four of them. Each has 16 64-bit elements. They 
are used to temporarily hold vectors which are 
fetched in advance from main memory. 

Post-Store-Vector-Buffers. There are two of 
them. Each has 16 64-bit elements, They are used 
to temporarily hold result-vectors which are 
produced by ALU and waiting for being stored 
into main memory. 

Just as FR, lookahead and post vector buffers 
may also be used as scalar buffers. 

Lookahead and post buffers are generally 
allocated automatically by hardware. 

5. --Operation—Control-Bit-Vectors. There are 
eight of thems: Olo~ Oy. Every oly has 16 elements 
of 1 bit each. They are used to record the sta- 
tus or result-characteristic of 16 element-oper- 
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ations of vectors. They are used to control 16 
element-operations of vectors. . 

Ol --Operation-Control-Bit-Scalars. There are 
eight of themso,~o,. Every Qj has only 1 bit. 
They are used to record the status or result- 
-characteristic of scalar operation. They are 
used to control operations. 

oO andol are very useful for raising the fle- 
xibility of vector processing and for enlarging 
the range of parallel computation. Because of 
them, the operations of global dependency "con- 
ditional branch according to the results of ALU 
operation", which are very harmful to pipeline 
processing, may be appreciably reduced. 

6, MM--Main Memory. It consists of 16 magne- 
tic core memory modules. Memory addresses are 
assigned mod 16. Every module has capacity of 
32K 64-bit words. Access cycle time is less 
than 16 clock periods. | 

7. @--Indirect-Address-Control-Vector-Regis- 
ters, There are four of thems#,~@, . Every By 
has 16 elements of 22 bits each. 

8. *Y’--Compress and Spread—Control-Bit-Vectors,. 
There are two of thems", 7 . Every 77¥i has 16 
elements of 1 bit each. | 

9. be-Index Memory. It has capacity of 32 
22-bit words: be~b,;. 

A--Address Increments. There are twelve of 
themsA,~Ay- Every Aj has 22 bits. They are used 
to store address increment (or skip distance) 
between two adjacent elements of a vector in MM. 

1--Length-Store-Eelments. There are twenty 
of thems 11,,. Every 1, has 22 bits. 1 are used 
to store vector length (the number of elements 
of a vector), loop count number, etc. . 1 and A 
are the different parts of the same semiconduc-— 
tor memory unit of 32 words. | 


Basic word size: 64 bits. 

Main Memory capacity: 512K 64-bit words 

MM Vector access maximum rate: 1 word every 
clock tick, 

ALU speed (clock cycle number needed for. pro- 
ducing each computation result): 

floating point add: 1 cycle (dependent opera~ 
tion: 4 cycles) 

floating point multiply: 2 cycles (dependent 
operation: 5 cycles) 

floating point divide: 8 cycles (dependent 
operation; 12 cycles) 

most of the other operation: 1 cycle 


Main features (which, except 1,2 and 8, are 
not available in CRAY-1 [4]): 

1. Multi-Vector-—Registers, 

2. Vectors are treated in vertical-horizontal 
processing fashion (i.e. segment by segment 
fashion). | 

3. When vectors are being treated in segment 
by segment fashion, dividing vectors of arbi- 
trary length N into segments of 16 (or less) 
elements each is automatically done by hardware. 

There is a Vector-Segment-Length-Register 1', 
it controls the execution of vector instruction. 
Every vector instruction executes the operations 
of 1' elements (1'<16). Besides, we have two 
special control-type instructions; vector loop 
starting instruction " & " and vector loop 

| al; 


ending instruction " i." Loop body, which is 
i | 

7 Ww " AT ] it ; 

between (fei; and ie » Will be executed 


[N/1é] times. The loop ends as soon as N 
elements of vector have been processed. 
The function of instruction " (Faas: is: 


i 
(1)Noly3 (2)min(1;,16)91'; (3)1;-1'=1; 
The function of instruction " Tt." iss: 
if 1;=0, then loop ends; . 


if 1480, then doz(1)min(1;,16)al'; (2)1,-1'sh; 


(3)jump back to the beginning of the loop body 
for continuing execution. 

For example, program iss 

Boa) Feat TE, 
It will be processed in the 4 ( [50/1é]=4 ) 
passes through the loop as follows: 
the first pass,compute F( Upetss Vontes °° * )afonts 
the second pass,compute F( ug 515 Yig-ata °° ) hema 
the third pass,compute F\ Uga. iy Vrngqs °° * ) fg, a7 
the fourth pass,compute Plugs sq Vegnugs °° ° ) Mig ge 
where uj~j represents i-th element to j-th 
element of vector W. 

4. In this system A-type vector, as the 
basic object of vector operation, can have an 
arbitrary beginning address D and an arbitrary 
address increment A. That is, we can have 
instruction WeR.»sR, or {OR} 30, and W 
-type vector, its address vector is (D,D+A, D+2A, 
D+3A,°°**), both D and & may be arbitrary. 

When the row vector and the column vector of 
a matrix are frequently used in the alternative 
way, there is no need to transpose the matrix 
many times, we only need to use the different 
A-values, we can directly process the two kinds 
ofdirective vectors. 

Multi-dimention array has more directive vec-— 
tors. They also can be directly processed... 

_ As memory space allocation is concerned, it 

is best to select odd A for the most frequently 
and the second most frequently used directive 
vectors so as to attain the maximum access rate 
for MM. — 

5. Infirect vector @u (i.e. =(#]) is also a 
basic object of vector operation. Every instru- 
ction can directly use Bu as its operand, or 
store computation result into Au. That is, we 
can have instruction #ueR;4 Ry or RiOR}> pu, 
where, fis an Indirect-Address-—Control-Vector- 
-Register, UW is a common MM vector, if P=(io, 
iy sig,°**), then AW represents a MM vector 

Mis sis sig yee) 

For example: If in computational expression 
F(w,V,°*:)5f, we do not need to compute all 
the elements, we only need to compute a few 
elements which satisfy a particular condition. 
Let B be a vector consisting of the subscripts 
of those elements that satisfy the particular 
condition, then we only need to directly com- 
pute the following expression: F( BU, BY, «++ apt 

This feature is helpful for enlarging the 
range of parallel computation. 

6. Every instruction has the function of 
compressing or gpreading. 

Spreading vector 7yu is also a basic object 
of vector operation. Every instruction can use 
Wu as its operand, that is, we can have ins- 
truction WUOR}> Rk. 


isa A- 
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For example: Let Y=(1,0,0,0,1,0,0,0,1,0,0,0, 
1,0,0,0), W=(Up Uy ,Uz,°**). Then as an operand, 
YU=( Up ,0,0,0, uy ,0,0,0,u, »0,0,0,u3,0,0,0). 

Every instruction can compress the computa— 
tion result-vector under the control of 7 and 
store the compressed vector into MM. 

For instance; Let 7 be the same as before, 
and_(Xo5X4,°%+,X45 ) is the computation result 
of RiORj,then instruction Rj@Rj97U is executed 
as follows: X9> u,yX4>uU, 9Xp_ => Uz ,X,.—> Uz 

7. For the three register addresses i,j,k of 
vector instruction RYUOR}> Ry, no limit is set 
to them. i,j can differ from k, and i,j can al- 
so be equal to k. . 

8. In adition to the features of highly pipe- 
lined ALU and interleaved 16 modules MM, the 
operations of the three main units--Instruction 
Unit, MM Unit and AL Unit-—--are asynchronously 
and highly overlapped. Because of overlapping, 
the vector operation start-up time may be 
ignored. In general, there is no time delay | 
between two successive vector-operations in ALU. 
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MAX: AN ALGORITHM FOR FINDING MAXIMUM 
IN AN ARRAY PROCESSOR WITH A GLOBAL BUS 


Shahid H. Bokhari 
Department of Electrical Engineering | 
University of Engineering & Technology, Lahore-3l 
| Pakistan : 


Summary: 


The problem of finding the maximum of a set 
of values stored one per processor on an array of 
processors is analyzed. The array has a time- 
shared global bus in addition to conventional 
processor~-processor links. It is shown that the 
problem may be solved:on an n X n array in 


o(n2/3) time using a two-phase algorithm that uses 
conventional links during the first phase and the 
global bus during the second phase. Without a 
global bus the problem takes O(n) time. Two types 
of array interconnection patterns are considered: 
the eight nearest-neighbor pattern and the four- 
nearest: neighbor pattern. The analysis is shown 


to apply to both cases. Extensions to q-dimension- 


q/qtl 


al arrays result in an O(n ) algorithm. 


Global broadcast buses provide an attractive 


method of increasing connectivity of array process- 


ors. Although past applications of broadcast bus- 


es have mostly been in the area of local computer - 


networks, where they have been used to interconn- 
ect processors that have occasional bursty commun- 
ications [7], improvements in technology now make 
it viable to use them to interconnect a number of 
processors that are cooperating in the parallel 
solution of one problem. In particular, a global 
bus may be combined with conventional processor- 
processor links in an array of processors to per- 
mit non-adjacent processors to communicate without 
passing messages through intermediate processors. 
This type of structure blurs the traditional 
distinction between "tightly coupled" and "loosely 
coupled" computers. 


A practical example of such an arrangement is 
the Finite Element Machine (FEM), [1] being devel- 
oped at NASA Langley Research Center. The proto- 


type array is made up of 36 microcomputers arrang- 


ed in a 6 X 6 grid. It is proposed to ultimately 
construct a1024 processor array. Each processor 
is connected to its eight '"nearest-neighbors" 
through direct processor-processor links. In add- 
ition there is a time-shared global bus to which 
all processors are connected. 


The direct links allow pairs of adjacent pro- 
cessors to communicate with each other and also 
allow each processor to transmit a value to its 
eight neighbors. These operations can be carried 


This research was supported in part by NASA 
Contract NAS1-14472 while the author was resident 
at the Institute for Computer Applications in 
Science & Engineering (ICASE), NASA Langley 
Research Center, Hampton, Virginia 23665, USA. . 
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out in parallel. In contrast, the global bus 
permits any two non-adjacent processors to comm- 
unicate and for any one processor to broadcast — 
its value to all other processors. 


Since the bus is time-shared, only one pro- 
cessor may transmit a value on it at one time. 
Thus, although the bus improves connectivity, 
this is “time-shared" connectivity. 


Efficient algorithms for array processors 
must take into account the two types of. connect- 
ivity available. Challenging problems arise when 
developing such algorithms; consider for example 
the Mapping Problem [5] in which the objective is 
to minimize the usage of the time-shared global 
bus. 


This paper discusses the prblem of finding 
the maximum of a set of values stored one per 
processor [2]. At the outset each processor Pp 
has a value V(p) and the problem is to find the 
maximum of V(p) over all processors and to trans- 
mit this maximum value to every processor. This 
problem can be solved for an n X n array in O(n) 
time using a simple algorithm that uses only the | 
direct links. If only the global bus is used the 


time required is O(n’). We have developed a two- 
phase algorithm that first uses the direct links 
for a certain number of steps and then switches 
over to the global bus. The overall time required 


is o(n2/3) which is below the time required by 
either of the constituent algorithms. This seem- 
ingly paradoxical result is due to the optimal 
selection of the switchover point. This algorithm 
is a refinement of an algorithm proposed in [2]. 


Two types of arrays are analyzed in this 
paper. The eight-nearest neighbor array is the 
FEM discussed above. The four-nearest neighbor 
array has an interconnection pattern similar to 
that of Illiac-IV. | 


The purely local algorithm utilizes only 


direct links and proceeds as follows. Each pro- 
cessor local broadcasts its own value to its ‘'a' 
neighbors (a=8 for the FEM, 4 for the Illiac). 
This causes the broadcasting processor's value to 
be placed in the input registers of all its neigh- 
bors [3], [4]. Each processor then reads all of 
its input registers, updating its own value if the 
received value is greater. The maximum value 
spreads throughout the array in time proportional 
to the diameter of the array, i.e. O(n).. 


The purely global algorithm utilizes only the 
time-shared global bus. At the start of this al-' 


gorithm each processor attempts to acquire the 


bus. The bus is granted to an arbitrary process- 
or which uses it to broadcast its own value toall 


a processors in the array. At subsequent steps 
only those processors that have not yet had a 
chance to do a global broadcast attempt to acquire 
the bus. In the worst case this algorithm takes 


2 ; Sots 
O(n") time since the processor containing the max- 
imum value may not get the bus until the very end. 


We assume that a global broadcast takes t 
units of time while a local broadcast takes a-t 
units of time (where a is the number of neigh- 
bors). This closely approximates the behavior of 
the FEM's hardware [1]- [4]. 


The two-phase algorithm that we have develop- 
ed first utilizes the purely local'algorithm for 
k steps and then switches over to the global algo- 
rithm. Processors that have never had their val- 
ues updated during the local phase are called 


survivors. During the global phase only survivors 
contend for the bus. If we denote by Sh the max- 


imum number of survivors possible in ann Xn 
array after k steps of the purely local algorithn, 
then the time required by the two-phase algorithm 


is T= k-a-et + Sit. 


The program that executes in each processor 
is given below. 


begin 
survivor :=true;3 
(* begin local phase *) 
for j:=l to k do 
begin 
local_broadcast (own_value) ; 
for neighbor:=1 to a do 
begin 
local_ receive (neighbor,n value) ; 
if own_value€n value then 
begin 
own_value:=n_value; 
survivor :=false; 
end 
else (* do nothing *) 
end 3 
end; 
(* end local phase *) 


(* begin global phase *) 
while survivor do 
begin : : 
attempt _global_ broadcast (own_value); 
if successful then survivor:=false 
else 
begin 
global. receive (n_value) ; 
if own value€n_value then 
- begin 
own _value:=n_ value; 
survivor:=false; 
end 
else (* do nothing *) 
end; 
end; 
(* end global phase *) 
end. 
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The function local_broadcast(x) broadcasts 
the value x to all neighbor's input registers. 
local _receive(x,y) reads the input register from 
processor x into the variable y. Attempt_global | 
broadcast(x) will initiate an attempt to acquire 
the bus and broadcast value x. If this attempt is 
successful, the boolean variable "successful" will 
be set true. global receive(x) causes the cont- 
ents of the global input register to be moved to 
Xe 


Expressions for S, for both types of arrays 
have been obtained in ~ [6] . The optimal value 
of k can be derived using these expressions to 
obtain the minimum time. to find maximum using the 


two-phase algorithm. These are 

Ta = (6n2/3_4)¢ for the Illiac type array and 
= 2/3 

Tin = (12(n/2) -8)t for the FEM. 


For n processors connected in a ring (plus a 
global bus) the two phase algorithm finds max- 


1 
. “5 e e se 
imum in O(n’) time. For a q-dimensional array 


the time is o(n/ 4d, 


. These cases are discussed 
further in [6]. , 
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A PRACTICAL PARALLEL ALGORITHM FOR REPORTING INTERSECTIONS OF RECTANGLES 


Anita L. Chow | 
Communication Products Technology Center 


GTE Laboratories, 


Incorporated 


Waltham, Massachusetts 02154 


summary 

Given a set of N rectangles with their sides 
parallel to the coordinates axes, we are asked to 
report all pairs of rectangles which intersect. 
This problem has important applications in VLSI 
circuitry design rule checking [1,4]. There 
exists O(N log N + k) time algorithms [2] for 
reporting all k intersecting pairs on a unipro- 
cessor machine. However, for large input size, 
these results are not satisfactory. This concern 
has motivated our investigation of the rectangle 
problem using parallel computing machines to 
produce a faster algorithm. 


Two models of computation are used in this 
paper. They are the shared memory model (SMM) [5] 
and the cube-connected-cycles model (ccc) [6] 
which can emulate a cube model. The validity of 
the SMM resides in uncovering the inherent data- 
dependence of a problem while the validity of the 
CcC, which complies with the VLSI technological 
constraints, is the development of practical 
algorithms. 


We can say that two rectangles intersect if 
their edges intersect or if one rectangle en- 
closes the other entirely. Thus, we can solve 
the problem in two intermediate steps: (1) re- 
porting the intersections of horizontal and 
vertical line segments, and (2) two-dimensional 
range searching. 


First, we study the intersection problem of 
a set V of n vertical line segments and a set H 
of m horizontal line segments. Let T(v) and B(v) 
denote the y-coordinates of the top and bottom 
endpoints of a vertical line segment v in set V. 
We assume, for simplicity, the following: all 
T(v) and B(v) are distinct integers in the inter- 
val [0, N- 1], where N = 2n is the number of 
distinct y-coordinates of the endpoints, and N is 
a power of 2. For the general case, the readers 
are referred to [3]. 


A binary search tree Y of height log N can 
be produced for the set V. At each level in J, 
the nodes are indexed from left to right, start- 
ing with the integer 0. 
i, represents an interval [j-2', G + 1) 2'] 
and contains a list of edges v sorted in the 
positive x-direction, where B(v) $ j°2' and 
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The node N,(j), at height 
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(j + 1) 2' S$ T(v), and v does not belong to any 
ancestor of Nj(j) in &Y. This data structure is 
suitable for implementation on the SMM but not on 
the cCC. Thus, we have to transform JY into a 
data structure Y° which can be implemented on 
the ccc. Y&£° is a binary tree similar to 7, 
except with respect to the indexing of nodes. 
The left-to-right sequence of node indexing at 
any level of Y° is the bit-reversal permutation 
of the node indices at the corresponding level 
of FZ. Note that node Nj(j) in FY” contains the 
same list of vertical line segments that are in 
N,(2 (j mod 21) + ,3/2''|) in FZ. JF” can be 
represented as a collection & of arrays ElogN, 
E log N-1, --- Eo, where Ej; is the concatenation 
of the sorted lists of vertical lines associated 
with node Nj(j), in the ascending order of j. 
Associate with a vertical line v in E; a node 
number which is the node index j such that v 
belongs to N;j(j). Therefore, E,is a selected 
list of vertical line segments sorted lexi- 
cographically by their nodes numbers and their 
x-coordinates. 


ElogN, ---, Eg are determined one at a 
time, in the given order. Let C, be a list of 
candidates for E,, sorted by their potential node 
numbers and then their x-coordinates. Initially, 
potential node numbers of all segments are 0, and 
Ciog nN is the set V sorted by x-coordinates. 
From Ciog n, We extract segments which cover the 
vertical interval [0, N] to form Ejiopgn- In 
Ciog N-1» Segment v has the potential node number 
0, if 0 $ Biv) $N/2, and the potential node 
number 1 if N/2 $ T(v) $N. Note that a segment 
may have both numbers. We extract from the 
remaining segments in Ciogy-, the list C' of 
segments with potential node number 0 and the 
list C" of those with number 1. The concatena- 
tion of C' and C" forms C jog n-1- We repeat this 
process for constructing successively E hog N-1 ; 
gx By: 

To find intersecting pairs, we envision & as 
a binary search tree. At level i we associate 
with each horizontal line segment h a node number 
#(h) indicating that h may intersect some verti- 
cal segment in node N; (#(h)). It is obvious 
that at the Ejogn (the root), #(h) = 0 for 
all h. We maintain the set of horizontal seg- 
ments sorted by their node numbers and then 
x-coordinates of their left-most endpoints, the 
same manner as in E;. We. can use a one-dimen- 
sional range searching algorithm [3] to report 
all intersecting pairs at a level in &. We then 


fa) Extracting a selected subset of elements 
from an ordered array means. moving the sub- 
set to consecutive processors in an order 
preserving fashion. 


determine which node number in the next level 
should be associated with each horizontal seg- 
ment. We continue this process which geomet- 
rically traces a unique path, possibly two, from 
the root to a leaf. 


of processors. The formal descriptions and 
analysis of these algorithms are contained 
in [3]. 

In [3], it is shown that this method can be 
generalized when we have N'*"% number of proces- 
sors, 0 < aS 1, to improve the time complexity 
The time complexity of this algorithm for re- by a factor of a log N. 


porting all intersecting pairs of n vertical and m 


horizontal line segments is 0((log (n + m))? + k') 
with 4n + 2m processors, where k' is the maximum 
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The construction of Fis similar to @: 
the set S is first sorted by their x-coordinates. 
The resulting array is P,,,y. We then determine 
the node numbers for the next level and rearrange 
the order of points according to their node 
numbers . 


To answer the set Q of queries, we use # as 
a binary search tree. Initially, we sort Q by 
L(q). We extract, from Q, those q such that 
B(q) £$ 0 and N- 1 $ T(q). For those extracted 
queries q, a point s with L(q) S$ X(s) S$ R(q), 
must satisfy q. Therefore, we can use one- 
dimensional range searching to find all the 
points in these extracted queries. For the 
remaining queries, we determine their node number 
in the similar manner as we determined those for 
the horizontal line segments. We proceed until 
all queries are answered. 


The time complexity of two-dimensional range 
searching algorithm is 0((log (n + m))* + k) with 
n + 4m processors, where k is the maximum number 
of inclusions per query. 


The line segment intersection algorithm. and 
the two-dimensional range. searching algorithm are 
combined to give an 0((log N)* + k') algorithm | 
for reporting all k intersecting pairs of N rect- 
angles with N processors, where k' (§ k) is the 
maximum number of intersections per rectangle. 
With the best known serial algorithm requiring 
O(N log N +k) time, this algorithm yields a 
speedup of O(N/log N) and an efficiency of 
0(1/log N), which means that the algorithm is not 
only fast, but involves relatively little waste 
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Abstract -- A possible design alternative for 
improving the performance of a multiprocessor 
system is to insert a private cache between each 
processor and the shared memory. The caches act 
as high-speed buffers by reducing the memory ac- 
cess time, and they affect the delays caused by 
memory conflicts. In this paper, we study the 
effectiveness of caches in a multiprocessor sys- 
tem. The shared memory is pipelined and _ inter- 
leaved to improve the block transfer rate, and it 
assumes an L-M organization, previously studied 
under random word access. An approximate model 
is developed to estimate the processor utiliza- 
tion and the speedup improvement provided by the 
caches. These two parameters are essential to a 
cost-effective design. An example of a design is 
treated to illustrate the usefulness of this in- 
vestigation. 

1. Introduction 

In this paper, we present simulation results 
and an approximate analytical model to evaluate 
the performance of cache-based multiprocessors 
with a shared memory (SM) as depicted in Figure 
1. At the first level, each processor has a 
private cache (PC). The second memory Level 
comprises the L-M memory organization, which con- 


sists of lt lines and m memory modules per Line 
CBRI 77]. A Line is used to denote the address 
bus within the SM. Associated with each Line is 


a Direct Memory Access (DMA) controller which re- 
ceives a cache request for a block transfer of 
size b and issues b internal requests (CIR) to 
consecutive modules on the Line. In this paper, 
it is assumed that the interconnection network 
between the private caches and shared memory 
modules is a full crossbar. 

In the architecture of Figure 1, there is a 
data coherence problem in which several copies of 
the same block may exist in different caches at 
any given time. When a processor attempts to 
write in a cache, all the copies in other’ caches 
must be updated before the process is allowed to 
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Figure 1. Cache-Based Multiprocessor System 
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proceed. Various stolutions to the cache  coher- 
ence problem have been proposed [CEN 78, DUB 
81aJ. In summary, the cache coherence problem is 
solvable, and its impact on system performance 
can be minimized by efficient data-sharing 
mechanisms. In our study, the effect of the 
overhead caused by the enforcement of cache 
coherence is neglected. The program behavior in 
a processor will be characterized by its cache 
hit ratio, h, or miss ratio, 1-h. The determina- 
tion of the hit ratio of a program as a_ function 
of cache size, set size and block size have been 
investigated by several authors CSTR 76, RAO 78). 


Most studies to date evaluate the shared 
memory conflict problem for random word access 
{BRI 77]. We propose a model which is used to 


evaluate the degree of memory conflicts in a mul- 
tiprocessor system with private caches. The 
model permits us to determine the processor util- 
jzation. Furthermore, we compare the effect of — 
the cache on the speedup of the multiprocessing 
system. The performance of the system is a func- 
tion of the cache miss ratio, cache organization, 
processor characteristics and the shared memory 
characteristics and configuration. 


2. The Cache Model 
The processor system consists of p identical 
and synchronized processors. In each of these 
processors, we assume that a machine cycle con- 


sists of an integer number, d, of cache cycles. 
An instruction cycle usually consists of an in- 
teger number of machine cycles. Typical machine 
cycles are instruction fetch, operand fetch and 
cycle, which may involve register- 
register or memory-register references. It is 
obvious that in some machine cycles of a proces- 
sor, no cache memory references will occur. 
Therefore, let © be the probability that a memory 
request is issued by a processor to the cache in 
a machine cycle. Thus, the fraction of refer- 
ences made by the “processor to the cache in each 
cache cycle is x = 6/d. 

When the data requested by a processor is. not 
in its private cache, a miss occurs that causes 
the cache controller to issue a shared memory re- 
quest for a block transfer. In particular, if a 
read miss occurs, the block of shared memory 
words containing the location specified is 
transferred into the cache. We assume that no 
read-through strategy is implemented. If the 
cache is full, a cache replacement algorithm is 
invoked to decide on which block frame to free in 


order to create space for the new block contain- 
ing the referenced data. 
Cache management algorithms differ Gaetesty) 


in the method of resolving write misses. Ina 
write-through strategy, a processor always writes 
directly in the shared memory, and possibly in 
the cache if the block is present. Consequently, 
a block is never copied to shared memory when a 
block frame is freed. However, such a policy re- 
quires buffering of the write requests. Moreo- 
ver, the most efficient schemes to enforce cache 
consistency CCEN 78) are based on a 
write-back-write-allocate strategy. This policy 
is adopted in this paper. For a write hit, the 
data is written only in the cache. However, if a 
write miss occurs, the write-allocate policy is 
used to transfer the block containing the ad- 
dressed word to the cache. Hence in our model, a 


read or write miss requires a block transfer to 
the cache. 
If a cache block frame which has not’ been 


modified is to be replaced, it is overwritten 
with the new block of data. However, a modified 
block-frame that is to be replaced must be writ- 
ten to the shared memory (SM) before a block-read 
from the SM is initiated. In this case, two con- 
secutive transfers are made between the cache and 
SM. We denote the probability of a cache hit by 
h and assume that each time a cache miss occurs, 
a block-write to SM is required with a probabili- 
ty w, followed by a block-read from SM. 


Two methods of organizing the cache for block 
reads and writes are investigated. In one case, 
it is assumed that the two consecutive’ block 


transfers (one block-write followed by one 
block-read) are made between a processor and the 
same SM Line. This assumption will be satisfied 
if a set-associative cache is used in which all 
the blocks that map to the same set are stored on 
the same SM Line. Hence, in this method a_ cache 
miss requires the transfer of a 2b-word block 
with a probability w and the transfer of a b-word 
block with a probability 1-w. 

A second method of organizing the cache as-~ 
sumes that the two consecutive block-write and 
block-read requests are considered independent 
and hence have equal probability of referencing 
any SM Line (assuming independent reference 
model). This assumption may be valid in a fully 
associative cacne. The effect of making two con- 
secutive and independent block requests from a 
processor is to increase the effective rate of 
requests to the SM. 

In our models, the hit ratio is given. Gen- 
erally, the hit ratio depends on the locality 
property of the program mix, the cache replace- 
ment policy, and the block, set and cache sizes. 
Various studies have addressed this relationship. 
In CLEH 80], a program is characterized by a sim- 
plified mathematical model based on its instruc- 


tion mix and the model is used to estimate the 
hit ratio. Smith compares different cache re- 
placement policies (CSMI 78]. Under the assump- 


tion of a linear paging model, he shows that the 
ratio of the miss ratios between the set associa- 
tive and the fully associative caches is 


RCi,N) = 1s for 1> 3, 
where i is the set size (number of blocks in a 
set) and WN is the number of sets. This ratio is 
always > 1. It tends to 1 when Ni, the cache 


capacity, increases without Limit. This rela- 
tionship should be considered when comparing the 
set associative and fully associative caches from 


our models. Finally, the hit ratio is also a 
function of the cache and block sizes. Strecker 
presents empirical results for the PDP-11 family 


computers CSTR /76]. 
model is proposed. 


In CRAO 78], an analytical 
For a given cache size, the 
hit ratio improves’ as the block size increases 
from 1, due to the locality of the references’ to 
the cache. However, beyond a certain block size, 
the hit ratio decreases. This is due to the de- 
crease in the usefulness of the extra words ina 
block as the block size increases. For a given 
block size, the hit ratio increases monotonically 
with the cache size. If the hit ratio can be 
determined empirically on a uniprocessor machine 
CSTR 76) or theoretically CRAO 78, LEH 80), the 
models given in this paper can then be applied to 
evaluate the various performance’ indices. We 
will assume that the cache size is adapted to ob- 
tain a given hit ratio. 

For practical purposes, the absolute size of 
the private cache would be expected to be Large 
enough to accommodate at least the "working set" 
of the process so that the miss ratio, 1-h, is 
small Cin the order of 0.1). Furthermore, we as- 
sume that the block transfer time is also small 
(less than 64 cache cycles). Under these condi- 
tions, it is not necessary to perform a task 
Switch on a cache miss to another  runable_ pro- 
cess. Therefore, in this paper we shall assume 
that, on a cache miss, the processor enters a 
wait state while waiting for service of the 
desired block request, and into a sequence of 
transfer states while the block is being ser- 
viced. If a processor is not in the wait or in 
the set of transfer states, it is said to be in 
the active state. Hence, the processor utiliza- 
tion can be computed from the fraction of time 
spent in the active state. Finally, associated 
with each DMA controller of a memory Line is a 
buffer which queues the requests’ for block 
transfers. The DMA controller schedules these 
requests to the Line, using a First-Come-First- 


Served CFCFS) policy. 
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3. The Shared Memory Organization 

The shared memory configuration is derived 
from the L-M organization, which exploited the 
timing characteristics exhibited by semiconductor 
memories with address Latches [BRI 77]. The ad- 


dress cycle or hold time, age which is the 


minimum duration that the address is maintained 
on the address bus of the shared memory module 
for a successful memory operation, is usually 
less than the shared memory' cycle, Cy: 


Throughout this paper, we assume that the basic 
unit of time is the cache cycle, whicn is equal 
to t seconds. If the address and shared memory 
cycles are quantized so that they are expressed 
as an integer number of cache cycles, then 


a= Pay/t) and c = Pcp/t | 5 


so that a set of modules can be multiplexed on a 
Line. In general, 1<a<c. When a memory 


_ operation is initiated in a module, it causes the 


associated Line to be active for a units of time 
and the module to be active for c units of time. 
The shared memory, which consists of N = 2” in- 
terleaved identical memory modules, is organized 
in a matrix form in order to exploit the memory 
module characteristics (a,c). As shown in Figure 
1, a particular memory configuration (l,m) con- 
sists of l= 2 Lines and m= 2°" Lines and 
modules per Line, such that lm = N, for integer 
B > 0. The blocks in the memory are interleaved 
on the lines so that block i is assigned to 
modules on Line i mod Ll. It should be noted that 
this does not contradict the assumption made ear- 


Lier that blocks of the same set are on the same 
Line for the set associative cache model. 
Since the shared memory is used in the block 


transfer mode in_ this paper, we will assume an 
address cycle of a = 1 for the shared memory, in 
order to effectively utilize the Line. However, 


if a> 1 for a particular type of memory, the ad- 
dress cycle could be made equal to 1 by incor- 
porating an appropriate address latch in each SM 
module. Since a= 1, the memory module will be 
characterized by c in the rest of the paper. The 
model developed in LCBRI 77] is not applicable 
here, since it was for single-word transfers that 
are requested by pipelined processors. In order 
to effectively utilize the SM modules for block 
transfers, the modules on a Line are interleaved 
in a particular fashion, so that the servicing of 
two SM requests could be overlapped on the same 
Line. The SM modules on a line are interleaved 
so that a block of data of size b = 2 is inter- 
Leaved on consecutive modules on that Line. Let 
Line i and module j on that Line be referred to 
as L. and Ms respectively for O< i< Ll-1 and 


O<j<m-1. Then, the k-th word of the block of 
data that exists on Line i is in module k mod mM 
on that Line for 0 < k < b-1. It is important to 
note that the first word of a block that exists 
on line i is in the first module, M. 0” of that 
ao 
Line. In this paper, we assume that b> m. If b 
< om, memory modules M. bi, peqreeeeMe m1 will 
not be utilized, since a block starts in module 
M. ne 
1,0 
When 
i, the 
cessive 
modules 


an SM block request is accepted by a Line 
DMA controller at that Line issues b suc- 
internal requests (CIR) to consecutive 
on Line i, starting from module M. O° It 
are is- 

There- 
word of 


is assumed that these internal requests 

sued at the beginning of every time unit. 

fore, the internal request for the k-th 

the block will be issued to module M. j’ where j 
r 


= k mod m for 0 < k < b-1. It is obvious” that 
his set of b internal requests is not preempti- 
ble. Note that if b > mor if the cache is set 
associative, the (mt1)st internal request is for 
module M. 0° Consequently, the first request 


must a completed by the time the (mt1)st inter- 
nal request issued. This constraint is satisfied 
if c<m. 


4. Performance Analysis 


In this section we present assumptions and 
develop the models that permit us to evaluate the 
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as 


> 1. 


‘terval that 


b-mtc time units. 
service 
request is btc~1. 


various performance indicators of the cache-based 
multiprocessor system. 


4.1 Hybrid Simulation 


For analytical purposes, it is assumed that 
cache requests to SM are random and uniformly 
distriouted over all L Lines of the SM. This as- 
sumption is justified by the interleaving of the 
blocks across the Lines. One inference that can 
be. made directly from the above assumption is 
that the probability of a request addressing any 
module is 1/2. | : | 

In order to understand 
tics of the servicing 


the timing characteris- 
of requests for block 


transfer, we define the time instants t = and * 
Lim (t-At) and Lim (ttAt), respectively, for 
At+0 At+0 

At > 0. A time unit <t,t+1> may be thought of as 


beginning at time t* and ending at time (t+1). 
Hence, since a = 1, the successive Internal Re- 
quests which are generated to a Line in the ser- 
vicing of an SM request, do not encounter = any 
conflicts. 

Recall that when an SM request for a_ block 
transfer is accepted, the DMA controller issues b 
successive IRs. If the request is accepted on 
Line i at time t, then the IR for the k-th word 
of a block of size b is initiated at time ttk to 
moduLe Mey for j =k mod m and O < k < b-1. 


Since the SM module cycle time is c, module M. 


. 1,j 
will be busy in the intervals <ttk, t+ct+k> for 
the values of j. 


Since b = 2° > m= ge then Dis an integer 
Therefore, each module on a Line i which 
accepts an SM request for block transfer at time 


: b. 
t receives — internal memory requests. In par- . 


ticular, the last IR to module M. 0 is made at 
4 


time t+ @ -1) m= t+ b-m. Thus, the last in- 


module M. is (during the 
1,0 


current block transfer) is <ttb-m, t+t+b-mtc>. 
After this period, a new block transfer which ad- 
dresses Line 1 can be accepted. Since the 
current block transfer was initiated at time t, 
all block transfer requests arriving’ at 
tt1,t+2,...,ttb-mtc-1 will find line ij busy. 
Note 


busy 


We refer to this as_ the 
The actual service time of the SM 
This is the time taken to 
cess and transfer a block of size b when the re- 
quest is accepted. Since we do not implement a 
read-through policy in the cache model, the pro- 
cessor goes through a sequence of transfer states 
having total duration btc-1 before returning to 
the active state. That is, the block transfer 
must be completed before the processor can become 
active again. 

The cache-based multiprocessor system may be 
modeled as a closed queueing network shown in 
Figure 2. This network has been called the ‘cen- 
tral server model" CKLE 76]. The servers are the 
shared memory Lines and the requests are issued 


by a set of p processing nodes, each of which 


time. 


b-mtc m1 


CO THEE 


Central server model for 
the cache system. 


Figure 2. 


Lumps a processor with its local cache. The two 
segments of a server model each SM Line and re- 
flect the pipeline effect of the LM memory 
described above. 

The behavior of each processor is illustrated 
in Figure 3 for both cache strategies. Node "A" 
denotes an active state of the processor and node 
"W,'' a waiting state. Node "LT" represents the 
state for the first part of a transfer during 
which the Line is kept busy (line service time), 
and "ET,"' the state in which a transfer iS  com- 
pleted without holding a Line. These states have 
to be distinguished because of their different 
properties. Note that the state representations 
and their interconnections as shown in Figure 3 
do not constitute Markov graphs, since each state 
has a different average duration. These average 
durations are indicated on the graphs. The state 
of each processor changes asynchronously in an SM 
request cycle. The SM request cycle is the total 
average time spent in the active state, wait 
state, and set of transfer states (LT and ET). 
According to the model assumptions, the visit 
time (expressed in units of cache cycles) in 
state A is geometrically distributed with mean 
1/xM-h), and the visit time in a state ET is 
constant with value m-1. 

For the set-associative cache (Figure 3a), if 
a block-write is not required (with probability 
1-w) on a cache miss, then the Line which accepts 
the SM request is busy for b-mtc time units. 
However, if a block-write is required (with pro- 


bability w) in addition to the block-read, then 
two consecutive block transfers Ceach of size b) 
are made uninterruptedly on the same SM Line. In 
this case, the line that accepts the request is 
busy for 2b-mtc time units. 

The case of the fully associative cache is 


(b) 


Fully associative cache 


Figure 3. State representation for each 


cache implementation. 
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simpler (Figure 3b): if a cache miss requires a 
block-write (with a probability w) followed by a 
block-read, the processor submits these requests 
aS two successive and independent requests’ to 
transfer a block of size b in each case. Each of 
the two corresponding LT states thus have a_ con- 
stant duration b-mtc. 

In both cases, each processor goes’ through 
"independent" states (states A and ET), followed 
by “interactive” states (states W and LT). When 
in an independent state, a processor can proceed 
freely and does not interfere with the progress 
of other processors. Interactive states are 
characterized by a potential for conflicts with 
other processors. The interactive states are 
framed in Figure 3. To estimate the average 
visit time in such states, simulations are re- 
quired. Note that the foregoing analysis. that 
leads to the state representation of Figure 3 
simplifies the simulation significantly. Such an 
approach has been called "hybrid simulation" CSCH 
78). Table 1 is a compilation of some of the 


Table 1. Processor utilization for the set asso- 
ciative cache i 
(c=4, w=0.3, m=4, x=0.4, h=0.95, p=16). 


tL ob Simulation Model Error (4) 
1 4 0.595 0.601 +123 
8 0.299 0.300 +0.3 
16 0.150: 0.150 0.0 
32 0.076 0.075 “1.5 
64 0.058 0.058 0.0 
2 4 0.797 0.814 +2.1 
8 0.540 0.578 +7.0 
16 0.281 0.300 +6.8 
32 0.145 0.150 +429 
64 0.073 0.075 +2.7 
4 4 0.839 0.842 +3.6 
8 0.711 0.729 +2.5 
16 0.474 0.514 +8.4 
32 0.257 0.286 +11.2 
64. 0.133 0.147 +10.5 
8 4 0.850 Q.852 +0.2 
8 0.759 0. 764 +0.7 
16 0.596 0.614 +3.0 
32 0.385 0.414 +7.5 
64 Q.215 0.238 +10.6 
16 4 Q.855 0.856 +0.1 
8 0.776 0.777 +0.1 
16 0.644 0.650 +0.9 
32 0.465 0.475 +2.6 
64 0.283. 0.301 +425 


Simulation results for a realistic case (w = 0.3, 


c= 4, x = 0.4, h= 0.95, m=4, p = 16). The 
number of lines, L, and the block size, b, are 
variable. The performance index is the average 
processor utilization, defined as the average 


fraction of time spent by each processor in an 
active state. Both cache implementations have 
practically the same performance, for the same 
value of the hit ratio. —- 

Since these simulations are still expensive, 
despite the simplification, we have developed an 
approximate analytical model to estimate the pro- 
cessor utilization. 


4.2 Approximate Analytical Model 


The processor's behavior shown in Figure 3 are 


quite complex to model exactly. We propose an 
approximate model based on a method applied in 
CHOO 77] for the modeling of random word accesses 
in multiprocessor memories. We number the pro- 
cessors from 1 to p and the memory Line from 1 to 
Ll. Let | 7 | 


T(t) = lpg One] 


for k = 1,e0e,1, 
with i, j(t) = 1 iff processor j is not wait- 
o 


ing for or using Line k, 


and i, iad = 0 iff processor j is waiting for 
4 


or using line k at time t. 


I(t) is called the indicator vector for tine 


k at time t. Each component uF j&? indicates 


‘ ? 
whether or not processor j iS waiting for 


holding Line k. Note that a processor waits 
or holds a Line whenever it is in state W -.or 
Cinteractive states), respectively. Let Y be 
average fraction of time a given processor is in 
an independent state. Y is also the probability 
of being in such a state by the ergodic property 
CKLE 76). The symmetry of the system implies the 
same value of Y for all the processors. Similar- 
ly, let Xe be the probability that a given Line 


is busy and S, the average line service time of a 
request. Then 


or 
for 

LT 
the 


ProbL"'at Least one processor is waiting for 
or holding a given Line k"J 


x. = 
Ss 


1 - ProbC"no processor is waiting for or 
holding tine k"J 


= 71") (1) 


1 - Probl 14 ° Tk 2ttt tk p 


= 1 - ELi i 


k 1 ° Ty aeeetk py 
This Last equality results from the fact that the 
expectation of a random variable taking only the 
values OU and 1 is equal to the probability of the 
variable being 1. The rate of completed requests 
by a Line is 


X_/S (2) 

S 
In equilibrium, this rate can be equated to 
the rate of submitted requests to a line. To 
compute this second member of the equation, we 


have to distinguish between the two cache imple- 
mentations. 


eel 


4.2.1 Set-Associative Cache (Figure 3a). A 
processor submits a request whenever it exists 
state A. This occurs, for each processor, when- 
ever a cycle in the network of Figure 3a is com- 
pleted. Let C be the average time taken by such 
a cycle. From the definition of Y, we have 

_ 1/xM-hd + m1. 
Yoo .. 
The rate of submitted requests to the SM by 


any one processor is 1/C. Since there are p re- 
questing processors and each request is submitted 
randomly to any one of the lL Lines, the average 
rate of submitted requests to a given Line k is 
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iy Bit cee Ss 
C [ c- (3) 


1 
x(t=hy * 071 


Equating (3) and (2), and since for the Set Asso- 


ciative Cache (see Figure 3a), S = b(1tw)-mtc, 
one finds that 
a ce BS te ty 
S 1 re L 
x(1-h) 7 | 
(4) 
a 2s ry, . P,. _DCltw)-mtc 
with o = x€I-h) * 7 * aeGTxd-n * 
Combining equations (1) and (4) we have 
ELiL 7 Tp atte tk pd +pY=1. (5) 


This equation is exact for the set-associative 
cache. However, the first term of the L.H.S. of 
the equation is very complex to estimate in gen- 
eral. The approximation consists in neglecting 
the interactions between processors. As a result 
of the approximation, the components of I, (t) are 


not correlated. This approximation performs best 
for a short and deterministic Line service time. 
Indeed, Large instances of the Line service time 
are more Likely to result in instantaneous Longer 
queues and more interactions between the proces- 
sors. Under the non-correlation conditions, 


EL iy a tk, 2te tk, pt = ELI 4 . ELI, gie--ELiy oJ. 
If we denote by Z the fraction of time spent 
by each processor waiting for or holding a given 
Line k, equation (5) becomes 
(1-2) + py =1, (6) 
because of the symmetry of the system. 

On the other hand, since a processor is either 
in an independent state (A or ET), or in an in- 
teractive state (waiting or holding one of the 
Lines), then by the Law of total probability in a 
system with lL Lines we have 


Y+tle2Ze=t. (7) 
Using (6) with the condition that 1 - p Y > 0, 
z=1- C-py)!/P , 
Consequently, by the substitution for Z in (7) 
and rearranging, we obtain 
ye 1 = cae ; (8) 
p L 


Since Y is the average fraction of time spent 
in a state A or ET, the processor utilization, u, 
which is the fraction of time spent in state A is 


_ 1/x€1-h) _ Y 
(a :ee oh" 


C  ~ Tx@-h) GD * me 
Besides being a good approximation for short 


Line service times with Low coefficient of varia- 
tion, the approximation (8) was proven in {£DUB 
81b] to have the following desirable properties. 
Property 1: when p tends to » (and all other 
parameters are kept constant), Y tends to 1/p. 


Property 2: equation (8) has one unique real. 
solution between 0 and Min (1/p,1). 

As a. consequence of the first property, the 
approxtmation is correct asymptotically, when the 
traffic at the memory Cand thus the interactions 
between processors) increases. This can be seen 
as follows. When the number of processors in- 
creases, the system of Figure 2 tends to saturate 
CKLE 76]. Under saturated conditions, each Line 
is constantly busy, which means that Xe tends to 


1 for all the Lines. 
that Y tends to 1/p. 


4.2.2 fully-Associative Cache. A cycle 
through the network of Figure 3b may result in 
one request (with probability (1-w)) or two re- 
quests (with probability w). If C is the average 
cycle time, the rate of submitted requests by any 
one processor to the memory is (1+w)/C. For the 
case of Figure 3b, 


_ Wxti-hd + (m1) (142) 
ee 

Following the same reasoning as the one Lead- 
ing to equation (8), one finds that equation (8) 
js also applicable to the Fully Associative Cache 
Model with | 


Y 


52 Pp. Co-m+c) (1 +w) : 40) 


1 
xC1-h) + (m-1) (1+w) 
To obtain the processor utilization, we note 
that 


_ 1/x(1-h) _ Y 
Ue = pax (IY 
The Newton iterative method converges rapidly 
with an initial value Yo = 0.5. 


4.2.3 Accuracy of the Approximate Model. To 
check the accuracy of the approximate model, we 
have compared it with the hybrid simulation. 
Some results are shown in Table 1 that are typi- 


cal. The model is adequate for parameter — values: 


corresponding to an effective design, and it is 
able to detect a poor design. in Figure 5, the 
results of the analytical model for the set asso~ 
ciative cache was used to plot the processor 
utilization. The memory cycle time, c, was also 
varied; all the other parameters were kept thé 
same as for Jable 1. The simulation points are 
Linked to their arialytical estimate in Figure 5, 
The analytical model tends to slightly overesti-~ 
mate the utélizatjon. 


4.3 Speedup of the Cache~Based Multiprocessor 
we can derive the speedup of the cache-based 


multiprocessor system (called system 1) over 
another system <calted system 2) without caches. 


Of course, thé two systems have identical parame~ 


ters. That is, each of them consists of ap prox 
cessor system that has a shared memory with L 
Lines and m modules per Line. However, in system 
2, there is ho block transfer, since it is not @ 
cache*based systems, The memory modules are in« 
terleaved for gingle word accesses, as discussed 
in CBRI 773, ach memory reference requires a 
single word transfer. Since we assume the same 


Equation (4) shows’ then - 


memory parameters, the address cycle is a = 1, 
and the memory cycle is c. In this case the pro- 
bability of acceptance of a memory request. for 
system 2 is (see CBRI 78]) 


1-P 


Pp, =——,—_+_——_.,, (12) 
2 r (c-1) (1-P,) 
Te pierce EM RN 
| N 
a L | 
where 1 - P. =f1-1- ae rtp? and N= Len, 


and r'’ is the effective probability of a proces- 
sor making a memory reference in a time unit. 

The instruction mix parameter, 8, is the same 
for both systems. Recall that 6 is the probabil- 
ity of a memory request being issued by a proces- 
sor in a machine cycle. Hence for system 2, the 
fraction of references made by the processor to 
the SM in each time unit is x' = 6/T., where T, 


is the machine cycle time for processors in sys- 
tem 2. Note that x' is the probability that a 
processor of system 2 makes a memory reference at 
any time t, when the processor is in the active 
state. OO | : 

Since we assume that systems 1 and 2 have 
identical processors, the absence of the cache in 
system 2 and the service of each memory reference 
in the SM elongates the machine cycle time of 
each processor from Ty Cin system 1) to T>5 Cin 


system 2). It can be easily seen that 
T, = Ty - 1+c time units. Note that a cache cy- 


cle = 1 time unit. From section 2 of this paper, 
Ty consists of an integer number, d, of cache cy- 


cles. In system 2, a memory reference to SM may 
encounter a delay in service due to memory con- 
flicts. We denote by A the state in which the 
processor is active, and by W the state in which 
the processor is waiting. Furthermore, state B. 


denotes the state of the service of the SM re~ 
guest for i = 1,2,..,c. Figure 4 depicts the 
simple Markov graph for the processor in system 
Cn. It should be pointed out that the scheduling 
of requests for single-word transfers in this 
system is not FCFS as in the system with caches. 
Rejected requests are resubmitted one cache cycle 
later until they are accepted. These resubmitted 
requests are assumed to be independent and hence 
have equal probability of referencing any of the 
Ll liness This schedule, which is also depicted 
by the Markov graph of Figure 4, would slightly 


overestimate the true performance of the system 


without caches, Simutations indicate that the 
overestimation. is within about 5% [BRI 78]. A 
previous model in which the SM block requests are 

fox! . 
xF{PHP, ) 


At g my 
[ , © 


Figure 4, State diagram of a processor's 
transition in system without cache 


not buffered was studied in LBRI 81]. 
fered model presented here is. more accurate but 
not applicable to other interconnection networks. 

Let Qn and qy represent the probabilities of 


‘being in states A and W, respectively. A solu- 
tion to this graph yields 
p 
— Aa | 
A” P, +x'CC1i-P, )+eP, J ” 
| ore An AQ 
i] = 
x a ae 
Wy - Pa TCP, deep, a * nee 
2 2 2 


The processor utilization for system 2 is 


U> = Gye It can be seen that the effective re- 
quest rate is | 
, ae 
P. +x'CCi-P, d+cP, J * 

ora yr 


Again, an iterative technique can be applied to 
obtain a solution to the utilization, U5 = Gy. 


system 1 
This can be obtained from equation (9) or 


t= yt = 
r x"'q, + q (14) 


Let us represent the utilization of 
by U4» 


(11), for the set associative or fully associa- 
tive cache models, respectively. The effective 
machine cycles for a processor of systems 1 and 2 
are Ty/u, and T5/U5¢. respectively. Since 
Ts; = Ty + c-1, and T, = d, the speedup for the p 


processor system is 


The evaluation of the speedup permits us. to 
compare the effectiveness of the cache in the 
multiprocessor system. Certainly, this speedup 
is a function of many parameters. The discussion 
of the results given in the next section exposes 
the effects of the variability of these parame- 
ters on the system benronmante: 


3e Discussion and Conclusion 


In the following discussion, we assume that 
the multiprocessor system consists of p = 16 pro- 
cessors with private caches.. The machine cycle 
time of the cache based system is d = 2 (alk 
times are expressed in units of cache cyctes), 
and the instruction mix parameter, 6, is 0.8. 


The shared memory has an L-M configuration with m. 


= 4, and Ll (a power of 2) is between 1 and 16. 
Thus the total number of modules is variable. 
Other parameters of the study are b, the block 
size, (4 <b < 64), c, the memory cycle time (2 < 
c < 8) and h, the cache hit ratio ¢h = .95). 
Note, however, that for a given cache size, the 
hit ratio and the block size are not independent, 
as observed in CSTR 761. In this study, we as- 
sume that, for a given block size, a cache with 
an appropriate size is selected so that ° the hit 
ratio is kept constant. 

Tne processor utilization (uw) is a sleenwance 


The buf- 
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index reflecting the degree of match between the 
processors and the memory organization. | The 
throughput improvement provided by the introduc- 
tion of caches is measured by the speedup (S>- 


Note that u and S, 


are not. necessarily related: a system with high 
"Sy may have an unacceptably -tow "uu." In gen- 


as defined in section 4.3. 


eral, one desires a design with high processor 
til ization and speedup to justify the investment 
in faster processors and expensive cache 
memories, respectively. For the parameters 
chosen, there is Little difference in the results 
of the set associative and fully associative 
cache models. Hence, only the results for the 
set associative model are given in the figures. 
To Limit the cost, the analytical models are used 
to derivate the following curves. 

A comparison of figures 5 and 6 shows that the 


(b,c)= 

0.9 (4,2) 
oe He — (4,4) 

0.8 aia . « (8,2) 
: y en ans r ae (8,4) 
O.7F 4 (16,2) 


» U 


PROCESSOR UTILIZATION 
N 
N 


1 2 A 8 16 
NUMBER OF LINES IN SHARED MEMORY (log scale) 


Processor utflization for the set- 
associative cache 


Figure 5. 


PROCESSOR UTILIZATION,u 


I 2 k 8 16 


NUMBER OF LINES IN SHARED MEMORY 
£(log scale) 


Figure 6. Processor utilization for multiprocessor 


without caches 


caches can have a dramatic effect on processor 
utilization. In general, an increase in the 
block size causes a significant deterioration of 


the processor utilization. However, for a_ small 
block size (4), the processor utilization for the 
system with cache is much better than for the 
system without cache. This improvement is more 
dramatic when the memory cycle time, c, is large. 
Figure 5 also shows that increasing the number of 
memory Lines, lL, 1s not always cost-effective. 

The following throughput comparisons between 
two systems emphasize the design alternatives of- 
fered by the use of private caches. Both systems 
consist of 16 processors. In system 1, a private 
cache is added to each processor; the memory con- 
figuration is characterized by m= 4 andi <l < 
16. Hence, the cache controllers access the SM 
via a116 by | crossbar switch. In system 2, the 
processors are connected to an L-M memory with lL 
= 16 and m = 4 through a 16 x 16 crossbar switch. 
ALL the other parameters are as described earlier 
in this section. 

Decreasing lk reduces the total number of 
memory modules and hence the cost of the decoder. 
The most significant effect of L in a system is 


the reduction in complexity of the processor- 
memory interconnection network and hence the 
cost. Figure 7? shows the effective speedup 
7 P=16. 
h=0.95 
6 x=0.4 (m,b,c) 
ep == (8,8,8,) 
e5 gat ___ (8,16,8) 
4 ee 
a ee oe sotto (8, 8,6) 
a 4 ee ai ___ (8, 16,6) 
BO ee tally 
7 7 Pi 
3 rf) ge oo. (4,8,4) 
4 zg 7 (4,16,4) 
A/ 7 
eras (4,4, 2) 
5 (4,8,2) 
| (4,16,2) 
ig 
1 2 hk 8 16 
NUMBER OF LINES IN SHARED MEMORY, 1(log scale) 
Figure 7. Effective speedup for cache-based system, 


where system without cache has 2=16. 


achieved by the inclusion of cache memories into 
system 2 and the simultaneous reduction in the 
number of lines Ll. It can be seen that a signi- 
ficant improvement in the system throughput is 
still achievable by the simultaneous reduction in 
L and the inclusion of cache memories. This per- 
formance: improvement is even more pronounced for 
Large values of c. A possible significant reduc- 
tion in l gives the designer a choice. If for a 
small 
of a per processor cache results in a_ speedup, 
Sy > 1, the designer can consider trading off 


Low-cost multiport memories for the expensive 16 
x 16 crossbar switch used in the system without 


number of Lines, L < 16, the incorporation | 
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caches. In fact, as shown in Figure 7, the in- 
corporation of a per processor cache results in 
significant speedup in most cases, even for small 
ks | 
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1. Introduction 


This . paper presents analytic and 
simulaticn results for multiprocessors with 
two-level memory hierarchy of the type shown in 
Fig. 1. The first level of memory is a private 


cache and the second level of the memory is the 


main memory shared by all processors. The two 
levels are connected through a switch. In this 
paper we shall restrict ourselves to switches 
which are full crossbars or delta networks 
[pat79 }. The approximate analytical model 
presented is simple but remarkably close to the 
ex tensive simulation results. 
2. The Physical Model 


2.1 The Cache: For the purposes of this 
paper a cache-main hierarchy will be assumed to 
have an acceptable miss tatio less than 0.1 and a 
block size small enough ‘and/or memories = fast 
enough so that a block transfer takes no greater 
than about 64 cache cycles. An implication of 
the small block transfer time is that in a 
cache-main system it is not profitable to switch 
processes on a cache miss, because a process 
switching time is comparable to a block transfer 
time and alsa because the cache is not large 
enough to hold more than one working set for an 
acceptable miss ratio. Therefore, we can assume 
that of a cache miss, the processor is idle while 
the desired block is being transferred. Thus the 
system throughput of cache-main multiprocessor can 
be computed directly frem the total time spent in 
doing block transfers, if thé processor execution 
is not overlapped with a block transfer. 


232 Cache-Main Interconnection: Two 
interconneotion networks that will be studied here 
are full crossbars and delta networks. Both 
networks will be used here in the cirewit 
switching mode. Once a fault occurs in a cache, 
the fault handling hardware requests a block 
transfer from a particular main memory module and 
the network establishes a path between the cache 
and the main memory module. This path is held 
wntil the memory transaction is complete. The 
path cannot be preempted by any ether Fequests 
éoting from other cache medules, This description 
implicitly assumes that a block resides in & 
single memory module. However, a tremory module 
itself may be interleaved to increase its 
bandwidth. The. advantage of using dircuit 


switching and storing the block in one memoty 


module is the reduction in block transfer time. 
Both in thé crossbar and delta network there is an 
initial delay in astablishing a path due. to 


arbitration, decoding and setting of appropriate 


switches. Once the path is established the data 
oar be transferred at a high rate. 


This research Was Supported in ‘part by Joint Ser- 
‘vices Electronics Program (U.S. Army, U.S, Navy and 
U.S. Air Force) under Contract NO0016-79~C+0242, 
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3. Analysis 


Btaecae Cache Model: = In this 
ecue we develop an & an analytical model for a very 
simple cache organization. This will be extended 
later to include more complex cache organizations. 
The following assumptions define the simple cache 
model. 

1. Each cache fault involves one blotk-write to a 
main memory module followed by.one block-read from 
the same memory module. As a consequence of this, 
we further assume that once a path is established 
between the faulting cache and the requested 
memory module, the transaction (read and write of 
a block) takes a constant time equal to +t CPU 
cycies. 

2. Cache requests to main memory are random and 
uniformly distributed over all main memory 
modules. 


The miss ratio of a program as a function of 
cache size, block size and set size have been 
measured by several researchers [KAP73, STR76, PEUT7T 
YEHS! |. From such data it is possible to 
determine the request rate from a cache to the 
main memory. Let m be the probability that a 
cache makes a request to main memery in a given 
time unit, that is, m is the probability that the 
processor makes a request to the cache and that it 
is a fault. m typically would be less than the 
miss ratio, because not every oy cycle is a 
memory reference. 


To summarize the simple cache model, at each 
time unit. a cache makes a request to the main 
methory with probability m, after some wait time a 
transaction vetween & main memory medule and the 
faulting cache takes place, which lasta for t time 
units. Throughout this peried the ‘proceasor 
renains ide. 


2:2 analysis of simple gache model: A 
processor in our multiprocessor is in one of two 
states. It is either busy doing useful work or it 
ie idle waiting fer a cache-fault service ta be 


completed. The throughput of the system is 
directly proportional to | the . precessor 
utilization. Therefore, we ‘Shall uae = the 


performance, This can be canpated as follows. 
Censider Fig. 2, which shews the effect of 
cache faults and wait ‘times on the processor 
activity. Since each procesaor cycle generates a 
cache fault with probability m, there are on 
avévage wk faulta for kk wits of useful 
computation. Let w be the average wait 
encountered at each request. Siftce a block 
transfer taked t time wiits, the k units of useful 


processor attivity takes k + mk(wtt) time wnita, 
Assuming N prece@aors and 


d M main memory modules 
the following. can be computed directly from Fig. 
2c. The protessot utilization, oo 

U = R/[knak(wtt)] = 1/[1 awe) (4) 


The average number of busy memory modules, 
B = Nmkt/[k+mk(w+t) |= Nmt/[1+m(w+t) | (2) 
In terms of utilization U, B = NmtU (3) 


In the above expressions, the only unknown is 
the average wait time w. It is clear that the 
wait time depends on several factors, such as, 
request rate m, number of processors N, number of 
memory modules M, block transfer time t and the 
type of the interconnection network. Exact Markov 
analysis is always possible for specific numeric 
values of m,t,N and M, because of finite number of 
states. However, the state space is very large 
and therefore computationally very complex. In 
the absence of a reasonable analysis, simulation 
is the only other viable alternative. We have 
done extensive simulation of the cache model. One 
important outcome of the simulation was an 
observation that the processor utilization of 
given sized multiprocessor system with a specific 
network (crossbar or delta) can be approximated as 
a function of the product mt, where m is the 
probability of a cache-to-main request and t is 
the block transfer time. 


Consider the Fig. 2c once again which shows 
the activity of a single processor. While the 
processor is waiting, the cache is resubmitting 
the block transfer request again and again until 
it is accepted by the network; on the average 
this happens for w time units. After the request 
is granted, the network holds a path to a memory 
module for t time units. One can view this as t 
consecutive requests to the same module, each 
request requiring one time unit of service. Thus 
on each cache fault, the network sees an average 
of wtt consecutive requests for unit service time. 
Refering to Fig. 2c, in kt+mk(wtt) time units a 
total of mk(wtt) requests for unit service are 
made to the network. Therefore, the request rate 
(for unit service) from a cache module as seen by 
the network is 

m” = m(wtt)/[+1+m(w+t) | (4) 

In eae of processor utilization U of equation 
(1) we have 


m =1-U | (5) 


The approximation that we intreduce here is 
that wtt consecutive requests to a single memory 
module can be decomposed into wtt separate 
requests which are random, independent and 
uniformly distributed over all memory modules, 
without essentially changing the system behavior. 
The model that we will analyze is a system of N 
sources and M destinations, each source generates 
a request with probability m°” in each time unit. 
The. request is independent, random and wnifcrmly 
distributed over all destinations. Each request 
is for one unit service time. . Rejected requests 
are resubmitted as new independent requests and 
are made part of the new request rate m°. First 
we analyze the system with a crossbar and ther. 
with a delta network. 

The crossbar: We already have the average 
number of busy memory modules B from equation (3), 
namely B = NmtU. We can compute the same quantity 
another way. Hach main memory. module is addressed 
with probability m°/M from a _ cache. The 
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probability that none of the N cache modules make 
a request to a particular main memory module is 
(1-m°/M)”. Therefore on the average M[(1-m7/mM) | 
modules are not doing any memory transfers. In 
other words, the average, number of busy modules is 
=m[1 - (1 - m@/m)"] (6) 
substituting for B = NmtU from Eq. 3 and m°=1-U 


from eq. 5 we have, 
NmtU - omit ~ (4 - (-u)/m)*] = (7) 
The above equation in U can be solved by 
standard numeric algorithms using iterative 
techniques. A good initial value for U is 


obtained by setting wait time w=O in eq. (1), 


that is, setting U=1/(1+mt) which incidently 
corresponds to the maximum possible processor 
utilization. 


The delta network: A delta network is an nr 
stage network constructed from axb crossbar 
switches with a resulting size of a-xb’. Thus _ in 
our model, it is required that N = a” ard M = bd" . 
For a more complete description see [Pat79 |. 
Functionally, delta network is an interconnection 
network which allows any of the N cache modules to 
communicate with any one of the M main memory 
modules. However, unlike in a crossbar, two 
requests may collide in the delta network even if 
the requests were to two different memory modules. 
The average number of busy modules is computed 
recursively using the result of the axb crossbar 
as follows. 

NmtU - a = 0 (8) 
where, Mm. 21 - o, hb)” O<=i<n 
i+1 
and nm. = 1 - U 
4. Discussion of the Results 


In this section we present several results 
obtained using the above approximate analysis. 
The CPU utilization from approximate analysis was 
compared with the utilization obtained in 
simulation for a wide range of parameters. The 
comparison showed that our analysis overestimates 
the processor utilization in most cases. However, 
the error was less than 2% for mt<i and less than 
7% for mt<32. Thus the approximate analysis of 
the multiprecessor cache organization is quite 
accurate in the region where mt<1. As we shall 
see in the following discussion, it is this region 


which is of practical interest. 


Consider Fig. 3 which is a graph of 
processor utilization over a broad range of 
parameters. The utilizaticn plotted may be 


interpreted as simulation results or analytical 
results, since the differences are so small that 
they are not visible on the graph with the scale 
used. Since in the analysis, the processor 
utilization is a function of mt, the parameters m 
and t are not separated in this and other graphs. 
The graph shows three different systems, one is 
N=64, M=64 u ing, 64x64 crossbar, second is N64, 
M=64 using 2 x2° delta network and the third is 
the single processor system N=M=1. Since the wait 


time is zero in the case of M*M=1 system, the |. 


processor utilization from eq. (1) is 1/(1+mt), 

which serves as the upper bound on the processor 
utilization. It is clear from the graph that for 
mt>1 the processor utilization is less than 50%. 


Therefore in a practical system one must have the 
product mt much smaller than 1 for an acceptable 
level of performance. Therefore the region of 
interest is mt<1. As pointed out earlier, it is 
in the region of interest that our approximate 
analysis is most accurate. Figures 4 and 5 show 


the processor utilization in the region of 
interest. Figure 4 shows a graph for a 32x32 
crossbar network and a graph for 2°x2° delta 


network. Figure 5 shows the processor utilization 
as a function of the network size NxN using a 
crossbar. Both figures are obtained from the 
analytical model of the previous section. Other 
measures of performance may also be evaluated from 
the analysis. For example the average traffic 
between the cache and the main and the average 
wait time of a request. . 


From a designer’s perspective the above 
analysis shows that for optimum performance one 
must choose mt as small as possible. Recall that 
m is not the miss-ratio itself but it is directly 
proportional to the miss ratio and t is some 
function of the block size. <A typical curve of 
miss ratio vs. block size for a fixed cache size 
might lock like Fig. 6. From this, one can 
compute the graph cf mt as a function of block 
size, where m is the probability that the CPU 
cycle is a memory reference and that it is a miss, 
and t is the block transfer time. From the graph, 
one can choose .. the optimum block size 
corresponding to the minimum mt. For a given 
cache size, the optimum block size for the maximum 
throughput, may or may not correspond to the 
minimum miss ratio of Fig. 6. To reduce mt below 
this value, one can either decrease the miss ratio 
by choosing a larger cache or reduce the block 
transfer time by using a faster main memory. 
Another alternative is to change the simple cache 


organization so that some of the block transfer 
time can be overlapped with the processor 
execution. This alternative is discussed in the 


next section. 
5. Extensions of the Simple Cache Model 


Three most common extensions of a simple 
cache organization are: (1) Buffered write back, 
(2) Store-through, and (3) Load-through. All of 
these achieve the same objective, namely, overlap 


of the CPU execution with a block transfer. Each 
extension is described below. 
Buffered write back: On a cache miss, the 


block to be replaced, i.e., written back, is first 
stored in a high speed buffer. The desired block 
is then read into the cache module. Following 
this the buffer is written back to the main memory 
module. The writing back of the buffer is 
overlapped with the CPU execution. 


Store-through (write-through): In this cache 
organization, every write command from the CPU 
results in the word being stored in the main 
memory, regardless of whether the corresponding 
block is present in the cache or not. If the 
block is present in cache then the word will also 
be written in the cache. As a consequence, on a 
read miss, the desired block is loaded in the 
cache and the block being replaced is not required 
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to be written back. Here also the writing of each 
word in to the main memory is overlapped with the 
CPU execution. ~ . 


Load-through: On a cache-miss for a read 
reference, the desired word is directly loaded 
into a CPU register from the main memory, after 
which the block containing that word is read into 
the cache module. This strategy tries to overlap 
the CPU execution with a block read. Load-through 
can be combined with either of the two previous 
strategies of write-back and write-through. 


All of these organizations can be analyzed 
with the techniques presented in section 3. In 
each case the approximation used is to treat the 
memory traffic as consisting of independent single 
cycle requests. Furthermore, these requests are 


assumed to be uniformly distributed over all 
memory modules. Once the unit request rate is 
ex pressed in terms of known and wunknowm 


parameters, it can be substituted in eq. (6) or 
(8) to obtain the average number of busy modules, 
which is also expressed another way (similar to 
eqs. 2 and 3) to give a full set of equations for 
the solution of the CPU utilization. 


6. Concluding Remarks 
In this paper we have presented an 


approximate analytical model for multiprocessors 
with private cache memories. The accuracy of the 


model is remarkably good considering the 
complexity of the problem. In the region of 
practical interest the error of the analytical 
model is less than 2%. The same model is useful 
in computing several different measures of 
performance, such as. processor utilization, 


average wait time of a request and memory traffic. 


The central idea introduced in this paper is 
that of breaking up a request for a block transfer 
into several unit requests as well as treating 
waiting requests as several unit requests for the 
purpose of the analysis. This idea makes the 
analysis of more complex cache organizations like 
write-back, write-through and load-through as easy 
as the simple cache organization. As a side 
benefit, we now have a way to evaluate the 
bandwidths of crossbar and delta networks under 
asynchronous block transfer modes. 
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AN ANALYSIS OF A NEW MEMORY SYSTEM FOR CONFLICT-FREE ACCESS — 


Tzu Yun-Kuei, Yang Shao-Tung, Yue Chang-Hai 
Department of Computer Science 
Changsha Institute of Technology 7 
Hunan, China 


Abstract--In this paper, the features 
of a new memory system are described. The 
process of conflict—free data array ac- 
cess is analyzed for interleaved memory 
system. The factors affecting the memory 
bandwidth and access speed are discussed, 
and the prime number of memory modules is 
determined for the AP-—601 computer. The 
formulas for address transformation are 
derived and the corresponding networks 
are designed. 7 | 


Introduction [1][4][5] 


In interleaved memory systems, it is 
convenient to make m, the number of mo- 
dules, a power of 2, say m= 2. Then the 
least significant p bits of every(binary) 
address immediately identify the module 
to which the address belongs. But the ac- 
cess conflicts, in this typical systen, 
are very serious under some conditions. 
In order to generate conflict-—free data | 
array access, we make m, the number of 
modules, a prime number 31/17 for the 
AP—601 computer. | 


AP-601 is a high speed vectorized com- 


puter with ten multifunction processors 
of pipeline structure operating at 20 me- 
gacycles per second. The interconnection 
network of AP-601 is a double buss struc- 
ture. The memory access time is 400 ns. 
The memory control unit consists of two 
pipelines for access, so that each cycle 
(50 ns) may read or write two words(2«72 
bits) from or to the main memory system. 
Thus the highest frequency of conflict~ 
free data array access is 40 MC/S. Each 
data array access instruction may read or 
write 128 words. A simple analysis on the 
memory bandwidth shows that the access 
speed of 17 modules (Mode 17) is 1.5 
times that of 16 modules and that of 31 
modules (Mode 31) is 1.4 times that of 32 
modules. The hardware technique used to 
generate the conflict-—free access is an 
effective address transformation unit to 
match the prime number of memory modules. 


Determination of the Number of 
Modules for the. AP-601 Computer 
In principle, the number of modules is 
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so determined as to obtain the maximum 


data array access speed required by the 
actual computer system. | 
For the interleaved main memory of m. 
modules, the address distribution is 
shown in Table 1. | 
Table 1 


m1 
m+m—4 


em+m—‘1 


ooo 
eee 


eee 


o e 
asm+2 ee6¢ ajm+m—1 


0 
1 
2 
a 


Where Nm is module's number, 
Am is the address in the module, 
and the absolute physical address is 


(1) 


The address sequence of a data array ac- 


cess instruction is 


Ao 9Antd,Agt2d, eee sAgtid, eoee 


Where ag is the first address, and d is 


the index distance. 


Suppose,due to index distance d, we 
ean only have access to m' memory modules 
among m memory modules.Here m* is called 
the effective number of modules for the 
interleaved memory. Since in the address 
sequence, the difference between any two 
neighbor addresses is d,while in a memo- 
ry module the difference between any two 
neighbor addresses is m, so that in the 
address sequence, the difference of the 
two consecutive addresses, which fall 
into the same memory module, satisfies 
the equation m'd ={m,d}, 


ive, mtx Smeg} 2 _md_ | _m_ (9) 
| qd (m,d)d (m,d) 


where {m,da} is the least common multiple 
of m and da, (m,d) is the maximum common 
factor of m and d. Since m’ equals mn, is 
the ideal case, thus from(2) we should 
have (m,d)=1. Evidently, it is impossible 
for all values of d. But,when m is equal 
to a prime number mp,we would have the 
case 


m's 1? (dekmp ) 


1 (d=kmp ) (k=1,2,3,500.) 


In order to generate conflict—free 
data array access for AP-601's double 
buss structure, we should have 


hes, 2 Ue. (3) 


tep 


where Tm is the cycle time of a memory 
module, and tep is the cycle time of an 
instruction. For the AP-601 computer, Ty 


Tee m' > 16. 

Hence, we select the prime number 31/17 
for the AP-601 computer. 

_. The average data array access speed 
S17? or the average memory bandwidth 
bi17 Of Mode 17 is compared with that of 


Mode 16 as follows: [3] 


let | 
K 2 S17 = O17 
17 7 ae 
m1 6 bn16 
since 
bat7 as mM 47 
bn16 mM" 46 


where m',.(m",¢) is the average effective 


ene of memory modules of Mode 17 (Mode 
16). 
i.e@. 


It is calculated in Table 2. Similarly, 
we have Kz, a 1.4(relative to Mode 32). 
In the mean while, there are inevitab- 


ly some random scalar data access instruc- 


tions in between or just after a data ar- 
ray access. In Mode 17, the probability 
of immediate execution of such instruc- 
tion is only 0.18 to 0.6. But it increa- 
ses with the number of memory modules. 
So we select 31 memory modules (Mode 31) 
as the normal operating mode and Mode 17 
as the auxiliary mode for the AP-601 com 
puter. 

Table 2 


6 
8 
6 
4 
6 
8 
6 
2 
6 
8 
6 
4 
6 
8 
6 
‘ 
6 
8 


The Mothod of Address Transformation 


When mp *% 2P(p=0,1,2,...), it is ne- 
cessary to find the module's number (or 
address) Nm and the address in the me- 
mory module Am by means of address tran- 
sformation. In the AP-601 computer, this 
is done in only two stages. The execu- 
tion time of each stage should not ex- 
ceed 27 ns for matching the pipeline 
processing speed. 


Address Transformation for Mode3'‘1 


1.The Formulas for Calculating Nis, 


and A 34 


In the AP-601 computer, the total ca- 
pacity of the main memory is 1984k words 
for mp = 31. So the absolute physical 
address A = 21 bits (binary), 

i.e. | 


A =a) a, ceeee cng 


- of 


let A, = Ay. | | 7 The second stage performs two kinds 
A, 22 a, a, se ad | 7 of modifications for 5S" to obtain N m3 
aes, 3 a, 8 finally. | 
3 = 86 S7 8, - Firstly, substracting "4%" «31 
lea T ] 
ae 14 04289 538442 sc a from S", oe 
A Qs cB &, 58- Tee, 3 a: 
5 = 846%17%18 19 20 | | 3 8 =8,8,8,5 ,5, +000 yeolS" 558,83, : 


In order to find N31 and A354 9 - is re- 


Because max( A,+A,+An+tA,+A, )= 1111011. 
quired to calculate | ieee, Bila Bilal) ’ 


S' has only four kinds of different sta- 


~ A. A-A A.A : | tes from which the corresponding result 
A= aT | of Nz, is obtained as shown in Table 3. 
: | | | | Table 3 
_ tAatsAgas | ‘In ..=8,8,S_8,8 
5 | [Mm31 781525354 
2° = 


25 ,=10 415 
= A jAolsA 4 Ag (2 Oe as ee See 


= A qaoAgé ata, A ghz 


A ytAgtAgtA 4atas 
31 
From (1) we have 


4n34= A, A gAzA 4 +A qhoAs +A yA +A, 


+A, AotA, 


Hence, by logic (or table look-up) we 
finally obtain 


A, tAstAstA,+AG | 
+ (4) - $, 3,8 
31 m31 ~ 3 34 35 : 

A, +tA,tAgtA, +A . : 
where elec is ie Bie is the quotient _ From Table 3, the required logic for- 
31 mulas are formed as follows: 

A, +AntAgtA +A | maids ee 
aie ic ai aie Si ae S,= FS, , 8, = FS, , 8, = FS; 
ae g F(S, © 8.) Fs, 8 
47 +t e 
A itathsthat 5), oop er Oe 
N -{j12275 45 ' a g' t 
1 = : : 
m3 | | where F =A+ S, :8, S3° Sy S5 
A, +AntAgtA +A, | | ; | 
where (iattaMotetey na thecwemanaer 3.The Practical Procedure for 
31 oO | Calculating As, 
A |thothgthyts | The first stage is to calculate _ 
of Se | "A, AnAgA,+A,AnAztA, Ath," . To save in- 
51 SO tegrated circuits, the calculation is 
divided into three sections, i.e. 
2.The Practical Procedure for Ay Ag! Ag! A M4 
Calculating N ¥ 
culating Nis, | "A, lo! A 4 
| jt Aye ‘AS 
The first stage is to calculate | a | - 1 Pe 
wo ‘ace tantatne " vs oats I 
S =A, tAntAgtA grag=m 28) 35938455 ° 
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where Ag+Ag=H oH, HoH SH Hs has been obtain- 
ed from the process for calculating N,, ° 
The first section is A, A5=858 4858 3 & 425° 


The second section is to do 


ul 


843 a Bas a 


847 848 849 420 


Ay 
As 


The process of address transformation 
for Mode 17 is much more complicated 
than that for Mode 31. In order to use 


H, Hy H, Hs H, H the same stages as those used for ara 
0 4° we put Ap=0, and directly take A,A,AzA 
+ a 2 2-574 
| : = mT ( teed by 
x o# | ; Since Ap=89=0,(it is guaranteed by 
a, a5 H, Ho Hs Hy Hs programming), we have 
The third section is to do = A, A, As Ay Ag 
BQ By 4 A, nA, zOq 
0 711742°13"14°15 A, Ay Ag Ay Ag 
+ &o SS 
of + | 
BG BF 4 484 294 594 4945 
The second stage is to do the final cal- = -8 ~12 
culation of Ais, , =A ywewe RO 752" 42" Hea) 
: , f ] we "¢ if ] e 
26 a, Ay as a, a5 H, Hy Hs Hy He Bg 4 4 084 584 2845 
| OO - a* | : 
0 0 0 0 a4 ac 0 0 0O ay at 9° 0 0 a&, ad2 
+ | F 
ee ee ee ee ee Pe ee es ee ee ee ee 
By Oy 82 As Aq AH BG B87 9B Bq 940%4194294591415 


where "oO4+F" is the quotient of 
A, +A, +A- +A, +A 
: 2 3 4 2 which has been 


3 31 
obtained from calculating Nis, - 


Address Transformation for Mode 17 


1.The Formulas for Na7 and = m17 


In this case, the total capacity of 
the main memory is 1088k words, so we 
have Similarly 


A a 


0 


a4 a5 @#e2eees 850 


Let A a 


o "0 | 

A,= Qa, Ao as ay 
A5= as Ag aq ag 
Ag= Ag 840944942 


= A shod 3 A,~A, A Ast, Ay-A, 


A,-AntAs-A, tas 
- 
17 


As before, Nat7 is equal to the remain- 


der of 
A, -A, tA “Ay +Ag 
17 
i.@. me hae +A 


career cea | 


A, +A +A, +34-Ao-Ay 


(s 17 } 


a aaa ana 


321 


47 
is ae er ae a 


7 1 | 


where Ant Ay is the reverse code of 
(A, + A,)- 


17 


2.The Practical Procedure for 
Calculating N m17 
In the first stage we calculate 
a > carw ws 
S" = A, tAstAp+Aat+A +3 
= Kol oS" S39 SF. 


In the second stage we produce two kinds 
of modifications for 5S". Firstly, sub- 
stracting "hy ky AZ"#17 from S", we get 


qt 32 og gu gu gn 


1 85 Sg Sy + 11) 06 of + 1 
/ ' t ’ . 
= ot’ 5, 55 8 34 ’ e 


Because max(A,-Aj+Az—A,+Ap+34)=1001111, 


S' has only five kinds of different sta- 
tes from which the corresponding result 
of Niat7 is obtained as shown in Table 4. 


Hence, by logic (or table look-up), we 
obtain a 7 
Nai7 =X54 So S354 

Table 4 
t_)’atararar a 
| state | S'=0l'S{S38484 |W 4 7=XS, 85558, 
| e q t ' t | t 
0 8183835) 0 S193835) 


From Table 4, the required logic formulas 
are formed as follows 


ol = X'S 35} : Sy=e' S35 3) +XS! 
5 =0835 94 +075) 

8. = (S83 @ 94) +X 5 

8, = HK @S) 


a22 


Thus the simplification of address 
transformation for Mode 17 is the com- 
pensation for 64kw reduction of memory 
capacity. In other words, we exchange 
64kw memory capacity for the feature of © 
conflict-—free access. © | | 


The Memory Control Unit 


of AP-601 Computer [2](6 ]([7][8] 


As shown in Fig.1, the memory control 
unit of the AP-601 computer is used to 
organize the configuration of Mode 17/ 
Mode 31, arrange the accesses of multi- 
requests to be executed, and process the 
compression and restoration of vector 


access, indirect address, pipelines’ 


chaining, etc. . 

It provides double busses for data 
array access, and a single bus for I/0 
and scalar access. In order to produce. 
address transformations continuously, 
there are two address forming pipelines 
consisting of four segments each. The 
former two segments are used to form the 
absolute address and the latter two are 
used to produce address transformations 
(Fig.2, Fig.3). To match the address 
forming pipelines, there are another two 
pipelines for data written in, and two for 
data read out. The control signal genera- 
tor is also a pipelined unit for synchro- 
nizing the operation of the whole memory 
control system. 

In short, all the main memory system 
including the memory control unit is 
fully pipelined and equipped with self- 
testing and self-checking circuits. 

The logic diagrams shown in Fig.2 and 
Fig.3 are separately designed from the | 
formulas which have been derived for cal- 
culating N., and A,2in the above sections. 


Conelusion 


As mentioned above, the ratio of the 
average speed of Mode 17 to that of Mode 
16 is Kyq@ied- Similarly, we have 


Kz,~ 1.4. Since the quantity of integra- 


ted circuits used for address transfor- 
mation may be neglected in comparison 
with that of the whole memory system,we 
get a 50% increase in access speed with 
1/16 increase in memory hardware for 
Mode 17. And we get a 40% increase in 
access speed with 1/32 decrease in me- 
mory hardware and 64kw decrease in me— 
mory capacity. 

Since the new memory system is fully 
pipelined, it results in very high data 
array access speed and is operated with 
very high efficiency. | 
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MODELING OF SHARED-RESOURCE SYSTEMS USING 


THE CENTRAL-SERVER QUEUEING MODEL 


N. C. Strole 
IBM Corporation 
Research Triangle Park, N. C. 


SUMMARY 


This paper introduces an efficient analytic 
method based on the central-server queueing model 
for establishing the initial design parameters for 
the simulation analysis of shared-resource: com- 
puter architectures. The analytic model, origin- 
ally designed for the optimization of multipro- 
gramming systems, is shown to be applicable to the 
cost-throughput optimization of a shared-resource, 
multiprocessor system specifically designed for 
concurrent execution of tasks within a single job. 
Results from the analytic model are corroborated 
via simulation techniques with the Duke University 
Shared-Resource Simulator (DUSRS). 


Background. One of the major goals in simulating 
shared-resource computer architectures is to de- 
termine an optimum resource configuration for hand- 
ling a specified workload. An optimum configura- 
tion for a system may be that combination of re- 
sources that maximizes system throughput for a 
given workload, or possibly a system configuration 
that suffers the least degredation in throughput 
due to faults in system components (fault- 
tolerant system). However, many system designers 
are interested in maximizing the system through- 
put for a given system cost [4]: An analytic 
technique based on the central-server queueing 
model, first presented by Trivedi and Wagner [9], 
is shown in this paper to be applicable to the 
cost-throughput optimization of a shared-resource, 
multiprocessor system specifically designed for 
concurrent execution of tasks within a single job. 
This model is demonstrated to be an efficient 
.method for establishing the initial design para- 
meters required by the Duke University Shared- 
Resource Simulator (DUSRS) [8] for the simulation 
of shared-resource computer architectures. 


An analytic model for maximizing system 
throughput in a multiprogramming environment by 
proper choice of device speeds, subject toa 
system cost constraint, was shown to be applicable 
to closed queueing networks and proven to yield a 
global optimum to the optimization problem [9]. 
The method of Lagrange multipliers was implemented 
in a Fortran program [5] to transform the problem 
into a system of nonlinear equations that were 
solved by employing an existing system subroutine 
package. Workload parameters, branching proba- 
bilities, and device costs were factored into the 
analysis over a range of system costs with varying 
degrees of multiprogramming, resulting in relative 


device utilizations, device speeds, and job through- 


put rates for each system budget and degree of 
multiprogramming. A thorough development of this 
analytic model can be found in [10]. 


-Of parallelism within the job. 


P. N. Marinos 
Dept. of Electrical Engineering 
Duke University 


Modeling Parallelism. The active resources ina 
multiprogrammed system, such as the bus and I/O 
channels, can be represented by a central-server 
queueing model having mt+l service centers 

(Figure 1). The ith service center consists of a 
single server and a queue for temporarily holding 
tasks waiting for service. The classical usage of 
the central-server queueing model assumes that the 
degree of multiprogramming, n, represents the 
number of concurrent jobs. Each job can be in 
only one of the servers or queues at any given 
time, and therefore, an individual job can be rep- 
resented as a series of tasks, each requiring a 
particular system resource and a certain execu- 
tion time for its completion. The tasks are 
serial in nature in that they pass from one server 
to the next during execution, thus not allowing 
task concurrency within a single job. All tasks 
must pass through the central server (CPU) before 
branching to one of the I/O servers with a fixed 
branching probability P. » £0; lj ascsag lls 


To accurately model shared-resource systems, 
task concurrency within a single job must be rep- 
resented. A precedence relation on the set of 
tasks, T, present in a job can be established by 
the partial ordering relation, R, such that tRt' 
whenever task t completes before t' can begin. 
Tasks t and t' are independent if they are non R- 
related. A directed path of tasks of length k may 
be defined as (x, x5) (x2 x3)... (%~-1%) where 
x; denotes a node (or task) at the ith level in the 
path. In such a directed path, for i and j such 
that lsi<jsk, x, is a successor of x; and xj is a 
predecessor of je A terminal task is one with no 
successor, while one with no predecessors is an 
initial task[3]. 


The maximum length, kmax, Of any path within 
the job, represents the number of potential levels 
Assuming a total 
of N tasks within a job, the degree of parallelism, 
D, for the job (i.e., the number of parallel tasks 
per level) is defined to be: 


D = N/K ax (1) 


Thus, D represents the mean number of concur- 
rent active tasks within the job at any given time. 
For example, the hypothetical precedence graph 
shown in Figure 2 is composed of 20 tasks with a 
maximum path length of 5, resulting in a degree of 
parallelism, D=4. 


The central-server model in Figure 1 can be 


. adapted to model a single job having a fixed num- 


ber of concurrent tasks, D, as shown in Figure 3. 
The CPU becomes the master instruction processor | 
(MASTER); the m I/O devices become m distinct 
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classes of functional units, each represented by 
a single equivalent server (FUNIT); and D 
becomes the degree of multiprogramming. When- 
ever a task leaves the new job path, it is 
assumed that this represents the termination of 
the present job, which is immediately replaced 
by a new job having the Same mean number of con- 
current tasks, D. 


The N tasks which comprise the job can be 
partitioned into g sets, Q., i=l, 2peceeerGy 
where the tasks in each at Qs require a type-i 
functional resource from the system pool of g 
resource classes. It is assumed that one addi- 
tional task is present in the system that does not 
branch to one of the functional units upon com- 
pletion at the MASTER to represent the termina- 
tion of one job and the entry of a new job along 
the new program path. We define Po as the new 
job path branching probability and it is given by: 


Py = 


1/ (N+1) (2) 
The probability, Pi, that any given task will 
require a type-i device is given by: 
lo. | / (N+1) =1,2,.+-9, (3) 
where |Q, | denotes the cardinality of set Q,. 
Also, the sum of ail P's must be such that: 


(4) 


The probability, Pit that a task within the job 
will. require service at the ith server is assumed 
to be identical to the branching probability, p., 
in the central-server model shown in Figure 1. + 
Thus, the branching probabilities for the central- 
server model in Figure 3 are defined by equation 


(3). 


It is now conjectured that one may equate the 
degree of parallelism, D, within a job to the 
degree of multiprogramming in the optimization 
problem, and thus use the analytic program to 
determine an optimum set of device speeds for a 
given budget. Branching probabilities are deter- 
mined by the distribution of task types Peneug awn 
the job as described above. 


Several restrictions must be applied to the 
optimization problem to permit a closed-form 
solution; that is, 


1. Task execution times are assumed to be 
exponentially distributed. 

2.. A task can occupy only one server at a time. 

All tasks must pass through the MASTER upon 

leaving a FUNIT. 

4. The number of concurrent tasks must remain 
constant. 


The assumption that a task can occupy only 
one server :or FUNIT at a time is not always true, 
since actual shared-resource systems permit tasks 
to simultaneously access multiple resources, such 
as memory, buses, and processing elements. 
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to implicitly define the hardware interconnection 


Therefore, analytic techniques are limited in 
their ability to accurately model delays due to 
resource contention. The analytic model is 
thus seen as a possible tool for establishing 


some general system parameters for use in a more 


representative simulation model of a shared- 
resource system. 


The Duke University Shared-Resource Simu- 
lator (DUSRS) has been developed as a research 
tool for the evaluation of shared-resource com- 
puter architectures and parallel program 
schemata [1]. The user can define a pool of m 
system resource classes to be shared concurrent- 
ly by one or more instruction stream processors, 
each with an optional multiprogramming capa- 
bility. Workload data can be generated by a 
separate program facility that allows the user 


and control structure of the system, along with 
the process tasks to be executed {7}. The 

simulation output gives the resulting job execu- 
tion times and functional resource utilizations. 


The basic hypothesis that the degree of 
parallelism, D, within a single job can replace 
the. degree of multiprogramming, n, in a central- 
server queueing model for the cost-throughput 
optimization of a shared-resource system, was 
validated by comparing the results from various 
runs of the analytic model with the results from 
comparable runs of the DUSRS simulation pro- 
gram [7]. Characteristic workloads, consisting 
of 10 jobs with degrees of parallelism, D, of 2, 
4, and 8, were evaluated by both the analytic 
and the simulation models for a system consist- 
ing of one MASTER, one PROCESSOR, and one I/O 
CHANNEL. The 95% confidence interval for job 
execution time was computed (Table 1) as described 
in [6] and plotted with the analytic results— 
(Table 2) as a function of system cost. _ 


The graphical results for D=8 (Figure 4) show 
that the mean job execution times as determined 
by the analytic model fall within the 95% con- 
fidence interval established from the DUSRS 
simulation results. In addition, the device 
utilizations from the two methods are in very 
close agreement as shown in Table 3. Thus, the 
hypothesis that the degree of parallelism, D, with- 
in a job can replace the degree of multiprogram- 
ming, n, in a central-server queueing model for 
the cost-throughput optimization of a shared- 
resource system is shown to be valid, and the 
device speeds produced by the analytic model 

can be interpreted as the effective speed of the 
FUNIT classes, comprised of an unknown number of 
functional units. However, the simulation runs 
were implemented so as to accurately reflect the 
queueing model restrictions of allowing only one 
server (FUNIT) of each type. Thus, the true 
parallelism within the workloads that would have 
allowed multiple PROCESS and IOTASKS to occur 
simultaneously was not simulated. 


The mean number of tasks in each queue in a 
FUNIT class can be estimated using a technique 
known as mean value analysis [2]. We can assume 
that this represents the average number of 


concurrent tasks within a job requiring access to [3] BE. G. Coffman and P. J. Denning, Operating 


each FUNIT class. A further simulation experiment Systems Theory, Prentice-Hall, (1973). 

was conducted in which the number of FUNITs in 

each class (except MASTER) was increased to this [4] D. Ferrari, Computer Systems Performance 
number to allow concurrent task execution, with no Evaluation, Academic Press, (1972). 

changes to the workload. The device speeds were | 

reduced so that the effective speed of each device [5] R. E. Kinicki, Queuing Models for Computer 
class was comparable to that of a faster single System Configuration Planning, Ph.D Disserta- 
device. The 95% confidence interval established tion, Duke University, (1978). 

from this experiment (Table 4) also enclosed the | 

analytic results as plotted in Figure 4. [6] H. Kobayashi, Modeling and Analysis, Addison- 


. Wesley, (1978). 
Conclusion. In this paper, an analytic technique 


has been described for determining a set of cost- [7] N. Cc. Strole, Simulation Facility for Per- 
throughput optimized parameters for use in a more formance Evaluation of Shared-Resource Computer 
realistic and detailed simulation model of a shared- Architectures, Ph.D Dissertation, Duke Univer- 
resource system for executing jobs containing sity, (1980). 
potential parallelism among tasks. This technique 
was validated using independent simulation exper- [8] N. Cc. Strole and P. N. Marinos, "A Shared- 
iments for a class of systems that were assumed. Resource Simulation Facility", Proceedings of 
to be representable by a central-server closed 1979 Johns Hopkins Conference on Information 
queueing model. Sciences and Systems, pp. 85-90. 
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TABLE 1 - DUSRS 95% CONFIDENCE INTERVAL- DEGREE=8 


J B -RUNTIM™ME CSEC) 


NEW JOB PATH 


1.030 0.722 0.462 
0.096 | 0.067 0.042 


MASTER 


FUNIT g 


FIGURE 3 
CENTRAL~SERVER MODEL FOR SHARED-RESOURCE SYSTEM 
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ANALYTIC 
pusrs TABLE 2 —- ANALYTIC RESULTS 


RUNTIME (CSEC) 


main + 


1.047 | 0.727 0.465 


HouUzagnmMYW 


10K 11K 12K 13K 14K 15K 
SYSTEM COST ($) 


FIGURE 4 - 95% CONFIDENCE INTERVAL» DEGREE=8 


TABLE 4 
TABLE 3 ~ ANALYTIC VS DUSRS UTILIZATION RESULTS 
DUSRS 95% CONFIDENCE INTERVAL WITH MULTIPLE 

DEVICES - DEGREE=8 


DEVICE ANALYTIC DUSRS 


JOB RUNTIME (SEC) 
TYPE ACTUAL ACTUAL 


MASTER . (36.0% BUDGT-—> $10K $12K © $13K S$15K 
PROCSR . 61.7% 
I70 ° 53.3% 


MASTER 52.3% 
PROCSR | 75.9% 
I70 1 3 68.5% 


| 9.737 0.631 0.472 
0.080 0.064 0.050 


MASTER 69.6% 
PROCSR 85.6% 


I70 . 81.8% ; ABOVE RUNS MADE WITH 3 PROCESSORS AND 3 I/0 CHANNELS 
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APPROXIMATE MODELS FOR MULTIPLE BUS 


MULTIPROCESSOR SYSTEMS 


M. Ajmone Marsan 


Istituto di Elettronica e Telecomunicazioni 
Politecnico di Torino - Italy 


ABSTRACT Markovian models are developed for the 
performance analysis of multiprocessor systems in- 
tercommunicating via a set of busses. The per- 
formance index is the average number of active 
processors, called processing power. The computa- 
tional complexity of the exact models increases 
very rapidly with system size, thus making the 
exact analysis impractical even for medium size 
systems. To overcome the complexity of computati- 
on, several approximate models are introduced. 
The approximate results are compared with the 
exact ones and found to be surprisingly accurate 
for a wide range of configurations. 


1. INTRODUCTION 


Early multiprocessor systems were develop- 
ed using crossbar networks to connect processors 
and memories /1-5/. With the availability of inex- 
pensive microprocessors, multiprocessor systems 
with a very large number of components are now 
becoming feasible and cost effective. For such 
systems a crossbar interconnection network may be 
intolerably expensive and in general it would 
provide a bandwidth much higher than needed. A 
more attractive alternative is represented by bus 
oriented interconnection networks. Single or mul- 
tiple bus architectures can be used, according to 
the bandwidth required for the specific applicati- 
on. These interconnection networks are generally 
called "multiple-bus" or "highway deficient" /5/ 
networks. Some papers addressing the analysis of 
bus systems appeared very recently in the litera- 
ture /5-7/. 

This study considers multiple processor 
systems that exchange information through a com- 
mon memory which consists of several modules. 
Processors and common memory modules are connect-— 
ed by a set of "global busses". Each global bus 
can connect any processor to any memory module. 
Every processor is also connected (and has exclu- 
sive access) to a private memory. We indicate a 
multiprocessor system with p processors, m common 
memories and b busses with the notation pxmxb. 
The block diagram of a 3x3x2 system is shown in 
fig. 1. 


This work was performed while M. Ajmone Marsan was 
vistting the UCLA Computer Setence Department. The 
research was supported in part by ONR under con-- 
tract NOOO14-79-C-0866 and tn part by NATO. 
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GLOBAL BUSSES 


PROCESSORS WITH PRIVATE 
MEMORIES 


Fig. 1 - Block diagram of a 3x3x2 system. 


The exchange of information is accomplish- 
ed by fist writing the information in the appro- 
priate common memory module and then reading it 
from the destination processor. Due to the shar- 
ing of both memory modules and busses, contention 
may arise, causing processors to queue for a 
resource which is currently in use. If the number 
of busses b is greater or equal to the smaller 
between the number of processors p and the number 
of memories m, i.e. b>min(m,p), then the conten- 
tion is only caused by the charing of memory 
modules. 

Multiple processor systems for which the 
inequality holds are usually known as "crossbar" 
architectures. Multiple processor systems for 
which the inequality does not hold are usually 
called "highway deficient" systems or "multiple 
bus" architecture. For these systems we assume 
throughout this paper that py)m)b. The case m)p 
can be analyzed using the same techniques descri- 
bed here; the models are generally simpler than 
those presented in this paper. 

It is possible to construct a queueing 
network model for the analysis of both types of 
systems. The general case is shown in fig. 2. 
Processors join memory queues, and before proceed- 
ing to service (i.e. accessing memory) they must 
be granted a permit (bus). The permit is returned 
upon completion of service. The general model is 


thus a closed queueing network with p classes of 


customers and with passive resources /8,9/, which 
in this case represent the busses. In the case of 
crossbar architectures the presence of busses can 
be ignored, thus making the analysis substantial- 
ly simpler than for multiple bus systems. 

A processor can be in one of three diffe- 
rent states: 


ae \/ 
; iz 

DO} 
e cet 


© 
Fig. 2 - Closed queueing network model. 


The processor can execute in its 
vate memory 
The processor can exchange data 
other cooperating processors, by 
ding from, or writing into the 
mon memory modules. 
The processor can be waiting to ac- 
‘cess a common memory module. 

We say that a processor is ACTIVE when it 
is in state (1), and the goal of our analysis is 
to determine the average number of active proces- 
sors, P, called processing power. 

P is the performance index considered in 
the sequel. Other important performance measures 
are Simply related to P. 


(1) 


(2) 
rea- 
com— 


(3) 


The assumptions we make separating the ope- 
ration of the system are similar to those found 
in the literature on crossbar systems. 


The model we consider is thus soentetely 
symmetric with respect to. processors and memori- 


- es. These symmetries are not necessary to obtain 


pri- > 


with > 


‘Each processing unit is active for some 


time while the CPU is executing a program that 
only requires accesses to its own private memory; 
the duration of these activity periods is an 
exponentially distributed random variable with 
the same parameter 4 for all processors. At the 
end of an activity period processors generate 
access requests directed to a specific memory, 
chosen at random among: the external common memory 
modules; each memory vis requested with the same 
probability 1/m. If a bus.is available and the 
requested memory is free, the processor. accesses 
it for: an exponentially distributed period with 
parameter; @, the same for all processors and 
-memories. { 


If either no bus is available or the 
requestedi. memory is. busy, the processor idles 
waiting: for the necessary resources. At the end 


of an-.access the processor begins a new activity 
period; bus and memory are released and canbe 
-accessed:'by other processors. .An. arbitration me- 
chanism for the assignment of the bus is assumed, 
that randomly chooses among the heads of the 
nonempty queues referencing free. memory modules. 
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a Markovian model, but allow some reductions in 
the size of the resulting: Markov Chain. 


With the above assumptions — we can con- 
struct a Markov chain to model the behavior of 
the system. Using the theory of "Lumpablie" Markov 
chains /10/, we can reduce the number of states 
of the chain /11/. 

~The state definition for the exact. oases 
chain is: | 


| (n_, a,» Ginn (1) 
where x | 
nS is the number of processors currently 


accessing a common memory 


q yr dy are the numbers of ‘processors 
queueing for the memories currently acces— 
sed, arranged in decreasing order 


Ve pants qQ_ are the numbers of processors 
queueing. for a free memory, not accessib— 
le because no bus is aves table; arranged 
in decreasing order. | 


The general pxmxb case is not easy to handle, 
even after lumping is applied. We will therefore 
introduce in the next section some approximations 
which further reduce the size of the Markov chain 
and permit us to attack the most general case. 


2. APPROXIMATE MODELS | 


The reason for the introduction of appro- . 
ximate Markovian models is that, for general mul- 
tibus systems, the number of states increases 
very rapidly with system size. The explosive grow- 
th is due to the detailed information that the 
states must record about the queues inside the 
system. In particular for each state of the Mar- 
kov chain the number of customers queued for all 
common memory modules must be recorded. That is, 
we not only need to know the number of the queued 
customers, but also must be concerned with all 
the possible ways of distributing these customers 
among the system queues. If we reduce the amount 
of information about the status of the queues we 
have no longer a first order Markov chain beha- 
vior in the evolution of the system through the 
state space. The approximate Markov models that 
we introduce in this section analyze the system 
behavior by assuming that the transitions between 
the states with reduced queueing information 
still satisfy the Markov property. The results 
that we will obtain in this way are approximate 
and must then be compared uO: the exact ones to 
test their accuracy. 

In order to define a astieies model, 


one needs to specify: 


a) the state definition, that is the 
amount of information used to describe the state 
of the Markov chain. As was mentioned before we 
will use reduced information about the queues in 
the system. . 

b) the method to calculate the transiti- 
on rates for the simplified Markov model. As the 
behavior is approximated by the simplified Markov 
chain the transition rates must be evaluated ac-— 
cording to some empirical rule, and several diffe- 
rent rules can be envisioned. | 
| Three different state definitions (named 
A, B and C) and two heuristic methods for the 
evaluation of the transition rates (named 1 and 
2) were considered. The approximate models are 
named using the letter referring to the state 
description and the number referring to the evalu- 
ation of the transition rates. 


Model Al - The. state of the system is 
represented by the total number of proces- 
sors waiting either for a busy memory or 
for a busy bus, n , and hy the number of 
processors currently accessing a common 


memory module, a We thus have a pair 
(n,n) © (2) 
m q 
The transition rates are evaluated by as- 
suming that each queued processor requests, with 
uniform probability, any of the common’ memory 
modules currently not accessible (this approxima-— 
tion implies that a queued processor can randomly 
reselect a new memory when a memory or bus beco- 
mes unblocked). 
Next we introduce a modification of model 


Al, by specifying a different method for the 
calculation of the transition rates: 


Model A2 = The state of the system is 
defined as in model Al). The transition 
rates are evaluated using an "averaging" 
technique. 


In order to evaluate the new transition 
rates between two macrostates, we count the num- 
ber of states that we merge into a macrostate, 
add all rates from the merged states to each 
neighboring macrostate, and define as transition 
rates the ratio between the sum of transition 
rates and the number of states merged. 

We now consider another definition of sys- 
tem state (yet retaining the rate computation 
rule of model A2): 


Model B2 - The state of the system is 
represented by the following triplet: (1) 
the number of processors accessing a com— 
mon memory module; (2) the total number 
of processors waiting either for a busy 
memory or for a busy bus; and (3) a flag 
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which is set to zero when no processor is 
queued for a bus, and is set to one when 
one or more processors are queued for a 
bus in order to access a 
memory module. . 
The transition rates are evaluated using 
the averaging technique described in the 
approximation A2. 


free common 


Clearly, model B2 is a refinement of A2, 
since the state is improved by adding a binary 
information concerning the system queues. 

All the preceding approximate models lack 
of one feature which is very desirable in all 
analytic models: namely, a closed form solution. 
We introduce here the simplest possible model 
which provides us with a closed form solution. 


Model C2 -— The system state is simply the 
number of active processors: no account 
is kept of the state of internal queues. 
The transition rates are evaluated using 
the averaging technique. 


We have reduced the system description to 
a birth and death Markov chain, whose solution is 
easily obtained. 


3. RESULTS 

We compare exact and approximate analytic 
results by considering a 6x4x2 system. Results 
are presented in Table 1. The first column gives 
the value of g =A /p » the second column shows 
the exact value of processing power as a function 
of 9 , evaluated from the exact lumped chain. The 
other columns show the percentage error that af- 
fects the processing power value computed with 
each of the four approximations introduced. For 
this case the exact chain has 37 states, whereas 
the approximate chains have 12, 12, 16 and 7 
states, respectively. 

Approximations Al, A2 and B2 seem _ to 
yield upper bownds on the processing power, where- 
as C2 gives a lower bound. The upper bound can be 
intuitively explained for approximation Al, since 
the random redistributing of processors to memo-— 
ries tends to relieve congestion and thus to 
improve performance. The bounds seem to be rather 
tight, 
were typically observed. 


since percentage errors well below 10% 

A 16-processor, 3-bus 
was Simulated, in order to test the accuracy of 
the approximate models for large system size. 
Results in table 2. 
Markov chains of models Al and C2, having 46 and 
17 states respectively, were solved. The results 
show that the approximate models behave very well 
for a system of this size; indeed, the approxima- 
te results are so close to the simulation results 
that in most cases they fall within the 99.9% 
confidence interval. since the system 


8—-memory , system 


are shown The approximate 


Moreover, 


of linear equation associated with the approxima— 
te Markov chain can be easily solved with numeri- 
cal methods, the approximate models. require much 
less CORDELEE time than a simulation program. 
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Table 1- Exact results and percentage 
errors for the 6x4x2 system. 


Table 2 - Simulation and approximate 
results for the 16x8x3 system. 


THE ANALYSIS OF A DECENTRALIZED CONTROL ALGORITHM | 
FOR JOB SCHEDULING UTILIZING BAYESIAN DECISION THEORY 
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Abstract -- The principles of Bayesian deci- 
sion theory can be applied as a systematic approach 
to complex decision making under conditions of im- 
perfect knowledge. Decentralized control of indi- 
vidual functions of a distributed processing sys- 
tem seems to be an especially relevant area for 
application of Bayesian decision theory. This 
paper formulates the distributed processing job 
scheduling problem in Bayesian decision theory 
terms, solves the problem for a particular algor- 
ithm, and discusses the usefulness, limitations 
and open questions regarding this approach. 


Introduction 


Many of the potential advantages of a "cooper- 
ative" distributed processing system depend on the 
ability to develop effective decentralized algor- 
ithms for the executive control functions of the 
system, e.g., communication and scheduling. Pro- 
posed decentralized algorithms must be analyzed 
for performance (e.g., average delay of messages 
for the communication function or throughput for 
the scheduling function), for logical correctness, 
reliability, overhead costs, stability, fairness, 
extensibility, cost and difficulty of initializa- 
tion, understandability, and how well the algor- 
ithms meet the specifications [1,2]. Even though 
many of these criteria are subjective, both the 
criteria themselves and their interactions must be 
addressed as best as possible by system designers. 
A further complication to the analysis is due to 
the conditions of imperfect knowledge that exist 
in a distributed system [1,3]. What is required 
is a methodology that addresses these criteria, 
their interactions, and the conditions of imper- 
fect knowledge. 


To date, mathematical programming, queueing 
theory and control theory have been applied to the 
analysis of decentralized control problems. Solu- 
tions using these disciplines are limited in some 
or all of the following ways: they are static op- 
timization problems, require large amounts of com- 
putation, depend on accurate knowledge of the cur- 
rent state of the system, do not include details 
of the algorithms, and have limited or no potential 
for incorporating the other important issues of an 
algorithm's effectiveness listed above. 


Our approach to the development and analysis 
of decentralized control algorithms is to use 
Bayesian decision theory [4,5,6,7,8,9]. This 
theory directly addresses decision making under 
uncertainty, and has the ability to incorporate many 
complex factors and their interactions via the 
utility function, e.g., 
Furthermore, because of the systematic approach of 
Bayesian decision theory, designers of distributed 
systems can apply the principles of Bayesian deci- 
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of the following actions A = 


reliability and performance. 


01003 


sion theory as a methodology for algorithm design 
and evaluation as this paper illustrates. 


This paper presents an example that illustrates 
the use of Bayesian decision theory in developing 
and analyzing a decentralized job scheduling al- 
gorithm. Several variations of the basic example 
are also presented. The examples are simple 
enough to understand yet involved enough to illus- 
trate the practical use of decision peony to 
evaluate job scheduling decisions. 


Decision Theory Under Uncertainty 


There are seven essential steps in the form- 
ulation and solution of a decision problem under 
uncertainty. A particular formulation and solu- 
tion is called Bayesian decision theory. In the 
following sections each of these steps is ex- 
plained for a decentralized job scheduling algor- 
ithm. We then explain how to make use of the 
evaluation by incorporating the analytical results 
into the scheduling algorithm. The algorithm it- 
self is described in a piecemeal fashion through- 
out. the seven steps of the problem formulation. 


Step One—Actions 


The job scheduling function in a distributed 
processing system has two tasks: schedule jobs to 
run on hosts, and move jobs between hosts to bal- 
ance the system load. The job scheduling function 
is implemented by n controllers (decision makers), 
one controller running on each host. For illustra- 
tive purposes assume n = 5. Each controller 
schedules on a FCFS basis when its host becomes 
available. There is no decision strategy needed 
for choosing the next job to run. -Hence, this 
action of the scheduler need not enter into the 
decision problem. On the other hand, in order to ~ 
balance the load each controller i may ree one 
{a, }j=0; 


ay: move no jobs 

a,: move one job from queue i to host 1 

a5: move one job from queue i to host 2 

az: move one job from queue i to host 3 

ay: move one job from queue i to host 4 

ag: move one job from queue 7 to host. 5 
Moving a job to oneself is considered as moving 0 
jobs, hence ay is not really required. It is used 


in this example simply to emphasize the action of 
moving no jobs independently of the controller 
that is being considered. When activated, a con- 
troller continues to iterate until a) it decides 
to move 0 jobs , or b) it has decided to move 


enough jobs to satisfy itself that it has done its 


share in balancing the load at this instance of © 


time. 


Part (b) must take stability into account, 


i.e., all controllers should not dump all of their 
excess jobs into a lightly loaded host because | 
that will most likely cause an unstable system. 


The actual movement of the jobs will -be done asyn- | 
chronously with the scheduler. 


be moved. 


~The time required 
for the movement is dependent on the length of the 
jobs, traffic of the network and the distance to 


Periodically, state information is 


passed around the system in a manner similar to 
how ARPANET routing information is passed [10,11]. 
The scheduling algorithm itself is assumed to com- 
plete before the next state information update 
This: assumption is reasonable 
Since the algorithm essentially just performs a 


message arrives. 


table lookup. 


Note, that for more complicated sit- 


uations each controller might have a different set 
of possible actions, or move clusters of jobs at 


once. 


Step Two—States of Nature 


~ As the second step of a decision: problem, the 
actual states of nature that could occur are speci- 


fied. 


fined in terms of how busy a host is. 


Such actions are simply added to the set of 
possible actions. 


For scheduling, the states of nature are de- 
In this ex- 


ample, this quantity is measured in terms of the 
number of jobs in the queue. 
quantification of the busy estimate can be based . 
on many factors including, for example, the esti- 
mated requirements of the jobs in the queue [12]. 
However, the state information used to derive the 
busy factor is independent of the use of that busy 


factor in the decision theory context. 
of nature are defined as 6 = 189» Oy) >+- 


More complicated 


The states 
815} where: 


~ above. 


2. Furthermore, in a real system this defini- 
tion of 6 is probably not reasonable because it 


does not distinguish between the following cases. 
Suppose host 1 through host 4 have 0 jobs and host 


5 has 8 jobs, versus host 1 through host 4 with 
1000 jobs and host 5 with 1008 jobs. In the 


- latter case job movement is probably not warranted. 


Another possibility is to define 6 relative to the 
local host's busyness. We continue with the above 
definition of ] for’ ease of explanation. 


‘Note, that due to the absence of uniqueness 
in both time and space [3] in a distributed system, 
0. cannot be known precisely. Hence, what is re- 


quired is the probability of a given state occur- 
ring, P(e, ), i = 0,...,15 and this is best esti- 


mated by ieesuponene: or simulation. The very 
nature of a distributed system requires that pro- 
babilities be utilized in the analysis rather than 
assuming accurate state information. Fortunately, 
decision theory under uncertainty satisfies this 
criterion for analysis. 


Step Three—Utility Function 


The function u(0, a.) that assigns. gains ie 
action a j=0,...,5 for each state 
-515 is called a utility func- 


tion. A difficult and subjective aspect of deci- 
sion theory under uncertainty is specifying the 
utility function. On the other hand, very complex 
interrelationships and situations can be subsumed 
within the specification of the utility function © 
allowing, for example, the evaluation of both re- 
liability and performance simultaneously. The use- 
fulness of applying decision theory to decentral- 
ized control analysis revolves around the utility 


losses) to each 
of nature Oa i=0,.. 


Gg: no hosts are least. busy : function. That is, if the derived utility func- 
84: host 1 is least busy by 1-2 jobs tions are accurate in predicting practical deci- 
. . | 9 4 sions, then this theory will prove of tremendous 
ve! nose “ Hegs tpusy ey | é done help to the development and analysis of decentral- 
83: host 3 is least busy by 1-2 jobs ized control algorithms. A hypothesis of this 
04: host 4 is least busy by 1-2 jobs paper Pai since there are methodological tech- 
= aa . 9 4 niques for deriving utility functions (even 
eG: host 5 = least busy by I-2 Jee 7 ; though the values are subjective), it is possible 
66: host I is least busy by 3-5 jobs inclusive to derive an effective utility function for the 
: : _e 3 eeTie 4 scheduling function. At a minimum, the derivation 
og NOSE - Peas -DUSyeey a2 Jes pita of the utility function should provide a good meth- 
8g: host 3 is least busy by 3-5 jobs inclusive odology for addressing complex interactions and un- 
8g: host 4 is least busy by 3-5 jobs inclusive certainty existing in distributed systems. | 
819: host 5 is least busy by 3-5 jobs inclusive The utility function u(o. 4295) can be expressed 
844: host 1 is lTeasy busy by >5 jobs in table format. Table 1 is the utility function 
: 94 : derived for our job scheduling example. We now 
29° Nest : Pecuees BRtSy icy = she / briefly describe how the values in the table were 
843: host 3 is least busy by >5 jobs chosen. | 7 | ae F 
ae Deters Use Vede Tua ya? ae It was decided to use a scale from 0 to 100 to 
845° host 5 is least busy by >5 jobs quantify utility. If the state of nature is 89 


Obviously, the 
The parameters 


set @ can be specified in many ways. 
1-2 jobs, 3-5 jobs and >5 jobs cho- 


(all hosts are equally busy) and the action is a, 
then no utility is gained or 
This serves as a 


(no jobs are moved) 
lost. from performing action Ans 


sen in the above specification of 6 would typically 
be tuneable system parameters and in practice the 
three classes would probably vary more than shown 


These techniques are not the subject of this paper 
but they are based on several simple axioms (see [9].) 
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reference point and is assigned the value 50. 
Hence, values between 0-49 indicate losses of uti- 
lity, while values 51-100 indicate gains of util- 
ity. For states of nature 6,-6, (host i is least 


busy by 1-2 jobs) inclusive, the utility of moving 
a job is chosen to be 60; | 

u(6, 2a, ) = u(65sa5) = u(0,,a3) = U( 84 a4) = 60. 
This is less than the utility for states 86 - 8% 


(host (i-5) is least busy by 3-5 jobs) which is 
chosen to be 75 to indicate a non-linear increase 
in utility; u( 65a) = u(07,a5) = u(89,a4) = 

U(69 say) = 75. The utility for states 0147815 


(host (i-10) is least busy by >5 jobs) is chosen 
to be maximum and assigned the value 100; u(0,7.a,) 
= u(845.a5) = U(8, 42a) = u(61 4224) = 100. 


This again accounts for a non-linear increase of 
utility over the previous states. 


Note entries u(@,,a,) = 55; u( 849 2a) = 60; 
u(6),.a,) = 65 in Table 1. These utilities are 


lower than the utilities for other hosts in these 
same situations. This might be true, for example, 
because host 5 has limited memory or is a slow pro- 
cessor. In fact, many of the complex factors that 
Should be incorporated into the analysis of decen- 
tralized algorithms could be subsumed in the util- 
ity function. As an example, the u(81 4544) in 


Table 1 was chosen to be 100. This was based on 
the idea that if host 4 were least busy by more 
than 5 jobs then sending a job there is exactly 
what should be done. However, u(0142a,) could be 


tempered to account for factors such as the over- 
head of moving jobs, or the reliability of that 
particular host, or the probability that other 
hosts will also detect the same condition and send 
jobs to host i. The last factor implies that to 
maintain a stable system if a host becomes lightly 
loaded it should not be flooded with jobs. 


The scheduling algorithm in this example, like 
the original ARPANET routing does not protect 
against ping-ponging (cycles). Hence, the utility 
may be further lowered to account for the probabil- 
ity of a cycle developing. | 


All other entries in Table 1 represent losses 
of utility. These entries indicate that moving 
Jobs to the wrong host is counterproductive. For 
example, u(815a5). the utility of moving a job to 


host 2 if the real state is 8 corresponds to a 


loss of utility and is chosen to be 30. Other 
losses are chosen to be consistent with this loss. 


_ In summary, even though the development of the 
utility function is subjective, attempting to cre- 
ate such a function is a viable methodology that 
forces designers to consider the factors involved 
and their interrelationships. Furthermore, the sen- 
sitivity of the quantification of the utility func- 
tion for a given problem can be determined and then 
used in the design process. 
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Each controller maintains a table of state 
information. The values in this table estimate 
how busy each host of the .network is. This table 
is periodically updated in a manner analogous to 
the ARPANET routing tables [10]. In this way, 
each controller maintains its own view of the 
state of nature of the network. The controller's 
view at a particular instant of time is called an 
observation. More formally, the conditional pro- 
bability that host i observes Z4 when the true 


state of nature is 6, is written as P(z,|6,). In 


our example, the set of possible observations is 
Z = Zqo Zyo*+-Z1e6 where each Zp corresponds to 
0. for k = i except that the Zz, 'S are observations. 
The values P(z, 6.) assumed for this example are 


listed in table 2. In practice these probabilities 
can be determined by simulation or by measurement, 
and improved over time. For a "quasi-static" dis- 
tributed system it is possible to obtain these 
probabilities dynamically. Again, such a probabi- 
listic view of a system is, in fact, what a dis- 
tributed algorithm must work with. Hence, the 
model being developed closely resembles the real 
system in this regard. 


Step Five—Strategies © 


Next, the controller needs to formulate pure 
strategies which are decision rules that specify 


the action aj j = 0,...,5 that the controller 


takes in response to a particular observation Zp° 
k = 0,...,15. The set of possible pure strate- 
gies equals the number of actions raised to the 
power x where x iS equal to the number of observa- 
tions. In this example, the number of pure strat- 


egies equals g!6 Fortunately, by using Bayesian 
decision theory an efficient computation procedure 
can be used to determine the best strategy with- 
out calculating all possible pure strategies. 
However, the set of probabilities P(e.) i=1,2,...; 


15 is required. For this example, the assumed 
P(e.) is given in Table 3. Again, initially 


P(e.) would have to be estimated, but could be re- 
fined as measurements of the system were taken. 


Step Six—-Value of Strategies 


In the decision theory problem formulation, 
step six is to compute the action probabilities 
for each pure strategy. However,-Bayesian deci- 
sion theory does not require the calculation of 
all the pure strategies nor the action probabili- 
ties. In lieu of these calculations, the Bayesian 
computational procedure takes three inputs. In- 
puts include the probabilities P(Z|o.) i = 0,...; 


15 (Table 2), P(0,) i = 0,...,15 (Table 3) and 


the utility function u(@;,a;) i= 0,...15 and 


j = 0,...,5 (Table 1). Then by a series of sim- 
ple multiplications a maximizing action for each 


observed state of nature is produced. (See [9] 


for a description of the computational procedure. 


It is” also easy oT calcuTate the aupected 
value of each action andthe weighted expected 
utility of the Bayesian strategy. This. Tast num- 
erical quantity can be used as a rough comparison 
of different scheduling algorithms. Precise com- 
parisons are not meaningful unless the utility 
functions of each algorithm are consistent 
each other. 


Step Seven— Choice criterion | 


‘Simply, the Bayesian strategy maximizes the 
expected average utility. That is, in our example 
the Bayesian strategy is the set of actions that 
maximize } : P(e. )P(Z|e, ) ule, A). Other choice 
| criteria - possible but are not ‘discussed in 
this paper. This then completes the formulation 
and means for a solution of a decision problem. 


Utilization of Results 


The designer first formulates the problem and 
algorithm in Bayesian decision theory terms. This 
is a subjective but methodological approach. A 
sensitivity analysis is then performed on the util- 
ity function to either enhance the confidence in 
the quantification of utility or identify those 
entries in the utility table that are highly sen- 
sitive. Careful treatment of these entries is 
then required. The next step for the designer is 
completely objective and consists of a simple com- 
putational procedure. The result is a set of maxi- 
mizing actions. for each decision maker (Table 4). 
In general, the maximizing actions for each deci- 
sion maker may be different due to different util- 
ity functions and/or conditional probabilities. 


The initial maximizing actions for each deci- 
sion maker are then stored locally as part of the 
scheduling algorithm. When a job scheduling deci- 
sion is to be made, a simple table look up is per- 
formed based on the current observation about the 
state of the network. This table look up identi- 
fies the proper decision for this scheduler. An 
important part of this process is the fact that 
this approach requires a low execution time over- 
head for the scheduler... Then special monitor nodes 
of the network act to dynamically adjust the pro- 
bability distributions and maximizing actions by 
gathering statistics, recalculating maximizing ac- 
tions, and downline loading these maximizing actions 
to the scheduling entities. This should be a rea- 
sonable heuristic if the system is quasi- “static. 
Simulation Studies are planned. 


Bayesian Decision Theory Calculations 


A PASCAL program was written for the Bayesian 
decision theory computational procedure and maxi- 
mizing actions for the scheduling problem described 
above were calculated. The result is shown in 
Table 4 and is labelled Run 1.. The input probabi- 
lities P(9;) for this run are reproduced in Table 
4 for convenience. The results include the action 
aj for each observation Z that maximizes expected 
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were altered. 


timal then this is a non-obvious result. 


processor or be extremely reliable. 


average utility. Additional sotutions were cal- 
culated for different probability functions P(e. ). 


These results are Ttabeled Run 2, 3 and 4 and ar 


presented in:Tables 5, 6, 7 respectively. Using 
the original probability function P(6;) of Run 1, 
Runs 5 and 6 were made by varying the Utility | 
Function. In Run 5 only one row of the utility 
table (table 1). was changed. In Run 6, two rows 
These results are reported in 
Tables 8 and 9 respectively. y 


The primary ‘intent of these calculations. is 
to show how non-obvious maximizing actions result 
even for this’ simple example. By obvious results 
is meant that if a host 1 is observed as least 


busy then the proper action is to send jobs to it. 
If however, the probabilities and utilities inter- 


act in a manner as to make some other action Op- 
"eS If more 
complicating factors are added into the problem, 
all of the maximizing actions may be non-obvious. 
The maximizing actions in Tables 4-9 inclusive 
that are non-obvious are marked with an asterisk. 


Several representative non-obvious results from 


Tables 4- 9 are now discussed. 
For example, in Run 2 regardless of the ob- 
servation Zq - - 2 inclusive jobs are sent to host 


1 to maximize expected utility. This bias occurs 


primarily because of the high probability that 


the true state of nature is 6], and the low utility 


gain of moving jobs when there exists only a dif- 
ference of 1-2 Jobs . 


The most interes Cine observation about Run 3 


is that for observation 219 (host 5 least busy by 


(move no 
This result is a cas eign of the high 


3-5 jobs) the maximizing action is aq 
jobs) . 


probability (. 3) of 6) (all hosts equally busy) 


being the true state of nature and the low utility 
for host 5 because it is the slowest processor in 
our example. Yet, when the observation is 215 


(host 5 least busy by >5 jobs) the maximizing 


action is a5 (move jobs to host 5). This implies 


that the utility of the movement of jobs dominates 


the low probability of this state existing. 


In Run 5 the u(0, a5) | is increased from 30 | 


to 50 over Run 1. This may occur for many differ- 
ent reasons, e.g., host 2 might be a very fast 
This increase 
is enough to cause maximizing actions for observa- 
tions Zo and z, to be ap. Yet, the u(6,.a3) in- 


creasing from 30-40 and u(e,,a,) decreasing from 


30-20 are not enough of a utility function change 
to result in any non-obvious maximizing actions. 


Similar justifications (post analysis) can be 
made for all the starred entries in Tables 4-9. 
However, before one does the Bayesian decision 
theory computations it is not easy to come up with 
the same results. 


Conclusions 


There is a definite need for an effective 
technique to develop, analyze and compare decen- 
tralized algorithms for "cooperative" distributed 
systems. Such systems are quite complex with many 
interacting forces, operate in a "noisy" environ- 
ment, are inherently probabilistic, and often 
operate under strict time requirements. Most 
analysis techniques do not treat problems of this 
nature. In this paper it was shown how to apply 
Bayesian decision theory as a methodology to deal 
with this problem. 


However, there are three main issues that 
must be resolved before this technique proves com- 
pletely useable in practice. The first is concern- 
ed with the ability to develop an effective util- 
ity function. Our hypothesis was that even though 
the utility function was subjective it has the 
potential for subsuming some of the complicated 
interacting forces in an accurate way. It is a 
systematic method for dealing with complex inter- 
actions and uncertainty. A sensitivity analysis 
can be performed to enhance one's confidence in 
the utility quantification. We claimed that at a 
minimum this approach was more methodological and 
worthwhile than current techniques. 


The second important issue concerns the dynam- 
ics of a distributed system. It is simple toclaim 
that new sets of maximizing actions can be calcul- 
ated when the statistics of the network change. 
Yet, will the number of significant changes in the 
statistics be small enough to be feasible? How. 
often will a switch to a new set of maximizing 
actions be needed? Will the new period be long 
enough to make the switch worthwhile? Answers to 
these questions seem possible for what might be 


called "quasi-static" distributed processing systems. 


The third issue is whether the two apriori 
probability functions P(e.) and P(0. |Z) can be 


known in practice. Initially, a pact guess is made, 
or a standard statistical model is assumed. Then 
these probabilities can be revised in terms of the 
measured activities as experience with the system 
grows. It is also possible to maximize expected 
utility given only P(e) and u(6.,a,). This is 


sometimes called the no data problem because the 
decision is made without making a current observa- 
tion. Therefore, if the decision maker does not 
take samples (observations) of the current state 
of the network into account then there is no néed 
for the P(o; |Z) information. This technique might 
be appropriate in some systems for some algorithms. 
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(extended abstract) 


ABSTRACT 


In this paper we implement several basic 
operating system primitives by using a 
"replace-add" operation, which can supersede the 
standard "test and set", and which appears to be 
a universal primitive for efficiently 
coordinating large numbers of independently 
acting sequential processors. We also present a 
hardware implementation of replace-add_ that 
permits multiple replace-adds to be _ processed 
nearly as efficiently as loads and _ stores. 
Moreover, the crucial special case of concurrent 
replace-adds updating the same variable is 
handled particularly well: If every PE 
simultaneously addresses a replace-add at the 
same variable, all these requests are satisfied 
in the time required to process just one request. 


1.0 INTRODUCTION 


Very large scale parallel processing, made 
possible by the refinement of VLSI technology, is 
becoming a reality. Although current MIMD 
(multiple instruction streams - multiple data 
streams) configurations rarely include more than 
a few dozen processing elements (PEs), much 
larger configurations are being designed 
(Burroughs [3], CHoPP (see Sullivan et al. [23]), 
etce) and configurations involving tens of 
thousands of PEs will soon be feasible. 


Since in such configurations the relative 
cost of serial bottlenecks rises linearly with 
the number of PEs present, users of these future 
ultra-large scale parallel machines will be 
anxious to avoid the use of critical (and hence 
necessarily serial) code sections, even if these 
sections are short enough to be entirely 
acceptable in current practice. 


In this report we implement several basic 
operating System primitives by using a 
"“replace-add" operation, which can supersede the 
standard "test and set", and which appears to be 
a universal primitive for efficiently 
coordinating large numbers of independently 
acting sequential processors. We also present a 
hardware implementation of replace-add_ that 
permits multiple replace-adds to be processed 
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nearly as efficiently as loads and _ stores. 
Moreover, the crucial special case of concurrent 
replace-adds updating the same variable is 
handled particularly well: If every PE 
simultaneously addresses a replace-add at the 
same variable, all these requests are satisfied 
in the time required to process just one request. 


Critical sections, used to enforce mutual 
exclusion when multiprocessing a single PE, were 
introduced by Dijkstra [5] and later refined by 
Knuth [16] and Eisenberg and McGuire [10]. 
Later, Dijkstra [7] and Lamport [17] studied 
similar issues for parallel processing. Although 
this report also considers similar issues, we 
assume a somewhat different computational model. 
Various multiprocessor synchronization 
primitives, including those used below, have been 
compared by Lipton [19], Burns et al. [2], 
Henderson and Zalcstein [14], and Dolev [8]. 


This report is organized as follows. First, 
our "paracomputer" model of computation is 
explained and the replace-add operation is 
defined (section 2). We then use the replace-add 
operation to implement semaphores (section 3) and 
to solve the readers/writers problem without 
recourse to critical section code (section 5). A 
distributed queue management technique that also 
avoids the use of critical sections is derived 
and then enhanced to form the core of a 
distributed operating system scheduler (section 
6). Finally, the replace-add hardware design is 
outlined (section 9). . 


2-0 COMPUTATIONAL MODEL 


The replace~add operation, on which many of 
our considerations are based, was introduced in 


the 1967 studies of the Athene’ hypothetical 


parallel computer system (Draughon et al. [9]). 


Before describing this operation, a generalized 


test and set that appears to be an attractive 
primitive for coordinating concurrent processes, 
we first discuss our model of parallel 
computation. 


2-1 The Machine - An ideal parallel processor, 
dubbed a "paracomputer" by Schwartz [21], 
consists of identical PEs sharing a common 
memory « The individual PEs may also have 
attached local memory, which we refer to as their 
"private" memories; the memory shared by and 


common to all processors is called "public", and 
variables stored there are called 
variables". The PEs can simultaneously read any 
public cell in one cycle. Moreover, simultaneous 
writes (including ‘the ‘replace-add operation 
described below) :are likewise effected in a 
single cycle and a ‘memory .cell ‘to which ‘such 
writes are directed will contain some one of the 
quantities written into it. This requirement on 
simultaneous memory -updates illustrates the 
(paracomputer) serialization principle: The 
effect of simultaneous actions by the PEs is as 
if.the actions occurred “in some (unspecified) 
serial order. Note ‘that simultaneous . memory 
updates are not serialized; in fact they are 
accomplished in one _ cycle. The serialization 
principle speaks only of the effect of their 
action and not of. their implementation. 
(Paracomputers must be regarded as idealized 
computational models since physical fan~in 
limitations prevent their ‘Yealization.) 


Our (realizable) approximation to. a 
paracomputer is an MIMD parallel processor in 
which each PE can directly access its private 


memory and can access’ the public memory via a 
(multicycle). interconnection network. Since in 


this more realistic architecture a public memory 
access may require many PE cycles, we must 
carefully define the notion of simultaneous: Two 
actions rl and r2 are simultaneous if rl starts 
before r2 finishes and r2 starts before rl 
finishes. 

2-2 Replace-Add - The format of the replace-add 


operation, which forms the basis of much of our 
subsequent discussion, is RepAdd(V,e), where V is 
an integer variable and e is an integer 
expression. This indivisible operation yields 
the sum S=V+e as its value and replaces the 
contents of storage location V by this sum. 
Moreover, RepAdd must satisfy the serialization 
principle: Assume that Vis a public variable 
(as it ordinarily will be) and many (perhaps very 
many ) replace-add operations simultaneously 
address V. Then the effect is as if these 
operations occurred in some (unspecified) serial 
order, i.e». WV receives the appropriate total 
increment and each operation yields. the 
intermediate value of V corresponding to its 
position in this order*. The following example 
illustrates the semantics of replace-add: If V 
is a public variable, if PEi executes 

ANSi <=— RepAdd(V,ei) , 
if PEj simultaneously executes 

ANSj <-- RepAdd(V,ej)  , 
and if V is not erate updated by another 
PEk, then either 
ANSi <-- Vt+ei 
ANS] <-- V+eitej 
or : 


* These intermediate values result from executing 
prefixes of the serialized list of operations. 


“public. 


- the serialized order in which first PEi 
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ANSi <-- V+eitej 

ANS} <-- Vtej 

case, the value of V_ becomes 
first possibility corresponds to 
executes 
executes its 

corresponds 
Suppose, to be 


and, in either 
Vteitej. The 


its replace-add and then PEj 
replace~add; the second possibility 
to the opposite serialization. 
still more specific, that V initially contained 
the value 10, and that ei=2 and ej=6. Then, 
after the simultaneous executions, V will contain 
18 and either ANSi=12 and ANSj=18 or ANSi=18 and 
ANS j=16. 


In section 9 we present a hardware design in 
which the replace~add operation requires 
essentially the same execution time as a load or 
store and in which simultaneous replace-adds 
updating the same variable are _ processed 
particularly effeciently. 7 


3.0 SEMAPHORES 


Having reviewed the basic replace-add 


operation, we proceed to describe its role in 
implementing a variety of higher-level 
programming operations. We first present a 


replace-add based implementation of Dijkstra’s 
{5} P(S) and V(S)} operations (thus illustrating © 
that replace-add obviates one important need for 


test-and-set), and then generalize this 
implementation to PVchunk operations PC(S,e) 
(resp. VC(S;,;e)) where S is incremented (resp. 
decremented) by e (see {25]). Subsequent 
Sections show that our implementation of PC and 
vC permits more parallelism than traditional 
implementations. _ 


Recall that the P and V operations are used 
to protect critical code sections by enforcing 
the following “PV-property": If many processors 
concurrently execute* 


Procedure PVTest 
Comment: Initially § S=l. 
Cycle { P(S) 
critical seetion | 
V(S) } | . 
End Procedure 


if the critical section does not modify $, and if 
no PE ceases execution, then at any time T at 
most one processor is executing its critical 
section and there exists a time t > T when 
exactly one processor is executing a critical 
section. 


3 o 1 
a 


Implenienting PV - In this section we present 
PV implementation that satisfies the 


* We use "{" and "}" for the tokens "Begin" and 
"End" respectively. However, our indentation 
convention obviates the need for these tokens. 


.PV-property given above (see [12] for a proof of 
this claim). The P(S) operation first waits 
until the public variable S$ equals 1 and then 
executes ReéepAdd(S,-1). If the result is zero, 
the critical section may be _ entered. If the 
result is negative, some other processor has 
control of the section and so P(S) "covers its 
tracks" and then tries again. The V(S) 
implementation consists simply of a replace-add 
incrementing S by 1. The following code is an 
appropriate implementation of these important 
primitives. (As will be explained below, various 
subtleties are involved.) | 


Procedure P(S) 
Repeat 
If S-1 > 0 Then | 
If RepAdd(S,-1) > 0 Then OK <-- true 
Else { RepAdd(S,1) 
“OK <-- False } 
Until OK | 
End Procedure 


Procedure V(S) 
RepAdd(S, 1) 
End Procedure 


To emphasize a subtle point inherent in our 
implementation of P, consider the following very 
similar, but actually incorrect, implementation. 


Comment: Incorrect implementation of P 
Procedure NaiveP(S) 

Repeat 

COE RepAdd(s, -1) > 0 Then OK <-- true 
Else {. RepAdd(S, 1) 
} OK <-- false } 

Until OK 

End Procedure 


If one compares this simplified form with 
the correct original shown earlier, it may appear 
that we have merely removed a "redundant" test. 
However, the simplified code can in fact fail due 
to unacceptable race conditions. Suppose, for 
example, three PEs, A, B, and C, execute P(S) at 
the same time with S having its initial value of 
le If the serial order effected is equivalent to 
A executes first followed by B and C, then S is 
set to -2 and A enters the critical section. 
Suppose that A subsequently leaves the critical 
section, thus incrementing S to -l1.. The section 
should now be free to be entered by either B or 
C. The above code will allow this to occur as 
soon as S is incremented to +1 from its current 
value of -1. However, this may never happen, 
since the following endless scenario is possible: 
B increments $ to 0 and then decrements S$ back to 
-1 before C executes its next instruction; thus 
B fails to enter the critical section. Then, 
while B is between instructions, C increments and 
immediately decrements. S. 8B and C continue in 


this fashion indefinitely causing S to vary 
between 0 and ~1, never reaching +1. Since every 
decrement occurs when S=0, the critical section 


is never entered and thus the PV-property is not 
satisfied. 
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public variable S is not restricted to +1. 


geen above 


Note that this race condition, unlikely when 
just three processors are involved, becomes 
steadily more probable as we increase the number 
of processors trapped in the semaphore. 


3.2 Implementing PVchunk - In order to solve the 
readers/writers and other synchronization 


problems, {it is convenient to define PVchunk 
operations where the increment e applied to the 
We 
write these operations as PC and VC and implement 
them using the same test-modify-retest paradigm 
for P and V. The following code 
assumes that 5S has been initialized to some 
positive integer. 


Procedure PC(S,e) 
Repeat. 
If S-e > 0 Then 
If RepAdd(Ss, -e) > 0 Then OK <-- true 
Else { RepAdd(S,e) 
3 OK <-- false } 
Until OK 
End Procedure 


_ Procedure VC(S,e) 


RepAdd(S,e) 
End Procedure 


3.3 Remark - It is worth noting that Dijkstra 
[6] considered the replace-add operation and 
examined the NaiveP procedure considered. above, 


noting essentially the same race condition that 
we have discussed. 


Dijkstra concluded that the 
replace-add was a less appropriate coordination 
primitive than a simpler "swap" instruction. 
However, this conclusion becomes progressively 
less acceptable as the number of PEs grows larger 
since the swap instruction leads to serial 
bottlenecks. = 


ee ee a, 


4.0 THE TEST-MODIFY-RETEST PARADIGM 


The previous geceton showed the need to test 
a semaphore before a decrement-and-test operation 
is applied. Since such test-decrement~retest 
(and corresponding test-increment-retest) 
sequences occur often, we define two procedures, 
each embodying one of these two basic sequences, 
which are used Enneueneue the remainder of this 
report. 


Boolean Procedure TDR(S,Delta) 
If S-Delta > 0 Then 
If RepAdd(S,-Delta) > 0 Then 
TDR <~- True 
Else { RepAdd(S,Delta) 
TDR <-- false } 
End Procedure -_ 


Boolean Procedure TIR(S, Delta, Bound) | 
If StDelta < Bound Then _.. ee 
If RepAdd(S, Delta): < Bound Then 
TIR. <-- true... 
Else { RepAdd(S, Delta). 
TIR <-- false }_ 
“End Procedure 


Using TDR the PC procedure of section 3.2 
can be expEcesce as ‘Simply: 
Procedure PC(S,e) 


? Repeat Until TDR(S, e) 
End Procedure “a 


5-0 READERS AND WRITERS 


In preparation for the more complex problems 


to be considered below, we now use the PC and VC 
operations to solve the well known 
readers-writers problem, in which a group of 


"reader" processes and "writer" processes are to 
share the use of a resourcee Many readers may 
use the resource simultaneously, but all. other 
processes become blocked as 
writer is active. 

The basic idea behind the following ‘simple 
solution is to maintain a counter equal to 
n(l-w)-r, where n is (no less than) the maximum 
possible number of active readers in the system, 
and r and w equal the number of active readers 
and writers respectively. | ioe 


Procedure Reader. 
PC(S,1) 
read-body 
vc(s,1) 

End Brocecure- 


eecceduee Writer z 


-PC(S,n) 

write-body 

VC(S,n) 
End Procedure 

Note that, in the absence of writers, no 

serial. code is executed by the above 
implementation. In contrast, standard "test and 
set" based implementations use (very small) 


critical sections to protect the adjustment of 
their counters. Although the . simple solution 


given above allows a continuing stream of readers 


to lockout all writers and vice-versa, solutions 
avoiding these potential lockouts are given in 
U2] and [20]. | — 
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soon as a single 


operations, 


‘respectively. 


#Qu, 
active insertions 


QueueUnderflow, 


6.0 MANAGEMENT OF HIGHLY PARALLEL QUEUES 


Although at first Gianke: . the important 


problem of queue management may appear to require 


use of at least a few inherently. _ serial 
we show in this section that a queue 
can be shared among processors without | using any 
code that might create serial bottlenecks. The 
procedures to be shown next maintain the basic 


first-in first-out | ‘property of a queue, whose 


proper formulation in the assumed ‘environment _ of 
‘large numbers 


of simultaneous insertions. ‘and 
deletions is as follows: If insertion of a data 
item p is completed before insertion of another 
data item q is. started, then it must not be 
possible for a deletion yielding q to complete 
before a deletion yielding p has started. 


Since queues are the central data structure 
for many algorithms, a concurrent queue access 
method can be an important tool for constructing 
parallel programs. When analyzing one of their 
parallel shortest path algorithms, Deo et al. [4] 
dramatize the need for this tool. 


"However, regardless of -the number of 

processors used, we expect that algorithm 
PPDM has a constant upper bound on its 

speedup, because every processor demands - 
pEavete use of the Q." 


6.1 The Algorithm - In the algorithm. below we 


represent. a queue of length Size by a public 
circular array Q(0:Size-1] with public variables 
I and D pointing to the locations of the items 
last inserted and deleted (these correspond to 
the rear and front of the queue respectively). 
Thus MOD(I+1,Size) and MOD(D+1,Size) yield the 
locations for the next insertion and deletion, 
Initially I=D=0 fl to 
an empty queue) 


We maintain two additional counters, #Q1 and 

which give lower and upper bounds 
respectively on the number of items in the queue 
and which never differ by more than the number of 
and deletions. Initially 

indicating no activity and an empty 
The parameters QueueOver flow and 
appearing in the code shown 
below, are flags denoting the exceptional 
conditions that occur when a processor attempts 
to insert into a full queue or delete from an 
empty queue- The actions appropriate for the 
QueueOverflow and QueueUnder flow ‘conditions are 
application dependent: One possibility is simply 
to retry an offending insert or delete; another 


#Q1=#Qu=0, 


queue. 


‘possibility is to proceed to some other task. 


| Code for eritical-section-free 


implementation of Insert and Delete is given 
below. 
First a TIR is used to guarantee the existence of 


The insert operation proceeds as follows: 


space for the insertion, and to increment the 
upper bound #Que If the MTIR fails, a 


QueueOverflow occurs. If it succeeds, the 
expression Mod(RepAdd(I,1),Size) gives the 
appropriate location for the -insertion while 
simultaneously updating the insert pointer, and 
the insert procedure waits its turn to overwrite 
this cell (this point is discussed below). 
Finally, the lower bound #Q1 is incremented. The 
delete operation is performed in a symmetrical 
fashion; the deletion of data can be viewed as 
the insertion of vacant space. 


Procedure Insert (Data, Q,QueueOver flow) 
If TIR(#Qu,1,Size) Then { 
MyI <-- Mod(RepAdd(I,1),Size) 
Wait turn at Myl 
Q(MyI] <-- Data 
RepAdd(#Q1, 1). 
QueueOver flow <-- False } 
Else QueueOverflow <-- True 
End Procedure 


peacedune Delete(Data,Q, QueueUnder flow) 
If TDR(#Q1,1) Then { 
MyD <-- Mod(RepAdd(D,1),Size) 
Wait turn at MyD 
Data <-- Q[MyD] 
RepAdd(#Qu,-1) 
QueueUnderflow <-- False } 
Else QueueUnder flow <=-- True 
End Procedure 


6-2 Cell Contention - Since we assume that PEs 
can execute at widely differing rates (due, for 
example, to memory contention, see section 12), 
it is possible that many active insert and delete 
operations can have been assigned the same queue 
cell location Le When the queue is nearly full 
or nearly empty, conflicts involving one insert 
and one delete are reasonably likely (but a 
simple "cell-vacant" flag would be sufficient to 
resolve them). However, the circular. array 
structure allows the (unlikely) possiblity that 
many active insert and delete operations all 
attempt to address the same cell simultaneously. 
We prevent this anomaly by associating semaphores 
with each cell (see [12] for details). 


6.3 Avoiding Integer Overflows - Care. is 
required to avoid potential overflows of the I 
and D counters caused by the combination of small 
word. size, large numbers of processors, and high 
queue insertion ratee Since we need. only 
maintain the values of I and D modulo the queue 
size, we may bound the e228 of I by inserting the 
statement: 

If I > MaxInt-#PE Then RepAdd(I, -Size) , 
where MaxInt is the largest representable 
integer, immediately before the statement that 
increments I. ‘Since many, even all, processors 
may execute this statement simultaneously, we 
require that MaxInt~-#PE-#PE*Size > MinInt . 
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_ § [RepAdd(Top,-1)+1] 


7.0. OTHER PARALLEL DATA STRUCTURES 


Having discussed queues, we now briefly 
indicate how the replace-add operation can be 
used to provide highly concurrent access to other 


important data structures. A more detailed 
presentation is found in [12]. 

7-1 Task Queues - In order’ to define a 
queue-like data structure appropriate for the 
scheduler component of a highly parallel 


operating system, the queue mechanism described 
above should be enhanced to permit insertions of 
items tagged with priorities and multiplicities. 
In such a queue, each item i has an associated 
multiplicity mi indicating the number of times i 
is to be deleted before it is actually removed 
from the queue, i-e. the pair (i,mi) represents 
mi consecutive entries of item i in a much longer 
(hypothetical) queue. The parallel operating 
system we envision needs to support the following 
two primitives: 

1. RequestPE(N,P,CodeBlock) - whereby a 
request is made for N processes. to 
execute a block of code at priority P. 


2- ReleasePE - whereby the PE invoking this 
primitive announces that it has 
completed its assigned task and is 
available for reassignment. 


The scheduler responds to the first primitive by 
inserting CodeBlock onto the task queue with 
priority P and multiplicity N.- To implement’ the 
second primitive the scheduler deletes an entry 
from the task queue and transfers eontror to the 
corresponding CodeBlock. 


7.2 Stacks - A stack of length Size is 
implemented as a public array S[0:Size-1l] with a 
public variable Top indicating the current top of 
the stack. A push operation addresses stack 
location S{RepAdd(Top,1)] thereby incrementing 
Top and a pop operation addresses stack location 
thereby decrement ing Top. 

pushes and pops can all 
stack location, a (small) 
parallel access queue into which pushes insert 
items and from which pops delete items is 
associated with each stack location. 


Since simultaneous 
address the same 


7.3 Avail Lists - Parallel access to the _ free 
Space (avail) list used by the linked allocation 
scheme described in Knuth [16] is acheived by 
maintaining a queue of pointers to free blocks. 
Acquiring (resp. returning) blocks is 
accomplished by deleting from (resp. inserting 
into) this queue of pointers. 


8. O° DETECTING COMPLETION OF PARALLEL ACTIVITY 

Since the cessation of activity involving a 
shared resource often indicates completion of a 
given task, it is important to be able to detect 
this event. In this report we ‘consider a typical 
example, namely detecting the situation in which 
a shared queue is and will remain empty, i.e. 
when all the PEs are trying to delete from an 
empty queue. This is the natural termination 
condition for applications in which multiple PEs, 
each acting as both a producer and as a consumer, 
use a global queue to buffer data items which 
they peer none themselves. 


If the seebize of deceering iouplaeaon is 
temporarily ignored, the aot Towing code typifies 
such “AppLecae fone: | 


Geile oe 
- If producer cycle Then t 
_ produce data 
Repeat Insert (Data,Q,Over flow) 
Until Not Overflow } 
Else { Comment: consumer cycle. 

Repeat Delete(Data,Q, Under flow) 
Until Not Underflow | 
consume data }} 


However, the queue Under flow condition | generated 
by Delete is’ not sufficient to signify task 
completion since inserts may still occur after 
the Underflow condition has occurred. Thus, to 
detect a state in which all PEs are trying to 
delete from an empty queue (this state is denoted 


T), we must modify the code shown above, which we | 
When a season ste occurs, - 


do as _ follows. 
instead of retrying the delete, 
counter W which is then cseaced with #PE. 
they are equal, state T has occurred. 
the PE loops until either 
nonempty, in which case W is decremented and the 
deletion is retried; or until W equals #PE and 
state T has occured. The detailed code follows: 


increment a 
If 


Comment : Initially W = 0. 
Cycle es: oe 
” If producer iets Then { 
produce data | 
. Repeat Insert(Q, Data. » Over flow) 
‘Until Not Overflow }  — 
Else { Comment: Consumer cycle. 
Repeat Delete(Data,Q, Bnsertiow) 
If Underflow Then’ { 
RepAdd(W,1) - oe 
Repeat Until W = #PE Or mae >0 
_ If W=#PE Then Comment: state T. 
Else RepAdd(W,-1) -}. 
Until Not Underflow | 
‘consume. data i} Se 


If not, . 
the queue becomes — 


9.0 HARDWARE IMPLEMENTATION | 


In’ hie: section we show . to P pete eeten an 


" omega-network enhanced to enable the network to 


process: multiple replace-add — operations in a 
highly parallel manner. 
description of Omega-networks ed SrEre I for an 
tp puStEae tons | 7 


| We suppose that. P = 2%*D PEs are to 
communicate with a. like number of memory modules | 
(MMs) and define a memory ‘cycle to be’ the time 


‘required for a single PE, in the absence of any 
- other 


communication traffic, to transmit a 
request to an MM and then receive a response. . 
This cycle time equals the MM access time. plus— 
twice the network transmission time. a 


9] Implementing Loads And Stores - The well 
known manner in. which an . omega-network can be 


used to implement memory loads and stores relies 


"ei1L" 


_ the simpler Burroughs approach. 
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transmitted 


appropriate 


on the existence of a. (unique) path connecting 
each PE-MM pair. Requests from PEs to MMs are 
transmitted along these paths and responses are 
along the reverse paths. 


Unfortunately, however, two concurrent requests 


conflict whenever the corresponding paths. are not 


edge disjoint. 


One way to resolve these conflicts is to 
one of the two requests and have it 
resubmitted by the PE. Despite the primitive 
nature of this scheme, proposed by Burroughs 


Corporation for their FMP [3], it exhibits good 
average case behaviour (see. [13] and [26]). 
Alternatively, we may resolve conflicts by 


enqueuing one of the two. conflicting requests in 
the first switch at which they conflict (see 
[1l1]}). In the sequel we do not use this queuing 
technique: For expository purposes, we prefer 


It is worth noting that some conflicts are 
"favorable": When concurrent loads and stores 
are directed at the same memory location and. meet 
at a switch, a scheme described below shows how 
they can be combined (and thereby satisfied) 
without. introducing any delays Moreover, by 
determining the most favorable serial order for 
these simultaneous requests, an enhanced switch 
can ‘combine them efficiently... The actions 
0 . for each favorable conflict are as. 
follows (some of these optimizations appear - ‘in 
the PEROEE design, see [24] and EPeh 2? 


Transmit one of the. two 


1. Load-Load: 
(identical) loads and return to each the — 
value obtained ftom memory 
| Qe baaa-store?”? ieiawales the ‘store. and — 
“-. yeturn its value to peers the load. 
36 Store-Store: rvadaaie: either : store ‘and 


ignore the other. 


See [1} and [18] for ax. 


Favorable conflicts - reduce communication 
traffic: and thereby increase the percentage of 
satisfied requests. Since combined requests can 
themselves be combined, any number of, concurrent 
memory references to the same location can all be 
satisfied in one memory cycle (assuming the 
absence of conflicts with requests destined for 
other memory locations). 


9.2 Implementing Replace-Add - The replace-add 
operation can be realized by augmenting the MMs 


with adders and connecting then, via an 
omega-network, to the PEs: When a RepAdd(X,e) 
operation is transmitted through the network to 
the MM containing X, the value of X and the 
transmitted e are brought to the MM adder, and 
the sum is both stored in X and returned through 
the network to the requesting PE. 


Since we expect that concurrent replace-add 
operations will frequently reference the same 
memory location, efficient performance in the 
case of favorable conflicts is very important. 
Fortunately, by including memory and an adder in 
each switch, the . network can achieve for 
replace-adds the excellent performance described 
above for loads and stores. (Note that, although 
we will continue to use the term "switch" for the 
devices located at the nodes of -the enhanced 
omega-network, these devices are _ 
closer to microprocessors than to simple switches 
and thus introduce non-trivial delays.) | 


When two replace-adds referencing the same 
public variable, say RepAdd(X,e).and RepAdd(X, f), 
conflict at a switch, we effect the serialization 
order "RepAdd(X,e) immediately followed by 
RepAdd(X,f)". This is done as_ follows: The 
switch forms the sum e+f, transmits the combined 
request RepAdd(X,e+f), and stores the value f in 
its. .lecal memory (see figure 2). When the value 
Y.is returned to the switch (in response to 
RepAdd(X,e+f£)), the switch returns Y to satisfy 
the original request RepAdd(X,f) and returns Y-f 
to satisfy the original request RepAdd(X,e). 
there was no other conflict, Y = Xte+f; thus the 
values returned are X+te and X+te+f and the memory 
location X receives this Y value Xtet+f. If other 
RepAdd(X,g) are simultaneously processed, the 
combined requests are themselves combined and the 
associativity of addition guarantees that the 
procedure gives a result consistent with the 
serialization principle. 


-In-summary, the switches process favorable 
replace-add conflicts as follows: 


RepAdd-RepAdd. As described above, a 


le 
combined request is transmitted and the 
result used to . satisfy both 
~ replace-adds. 
2. RepAdd-Load. . Treat Load (X) as 


RepAdd(X,0). 


fuctionally . 


TE . 


-implementation avoids 


3. RepAdd(X,e)-Store(X, f). Transmit a 
store of e+f and satisfy the replace-add 
by returning e+f. 


The above scheme reduces communications traffic 
and exhibits good average case performance. A 
detailed analysis and hardware design will appear 
in [20]. | 


10.0 SUMMARY 


Since the relative cost of serial 
bottlenecks rises linearly with the number of PEs 
present, elimination of such bottlenecks will 
become steadily more important in future parallel. 


processors. By exhibiting replace-add based 
bottleneck-free implementations for several 
important operating system primitives, and by 


presenting an efficient hardware realization of 


the replace-add operation, we hope to have’ shown 
that this operation is an _ appropriate 
synchronization tool for ultra-large scale 


parallel machines. We note that our replace~add 
the hardware bottleneck 
usually associated with concurrent access to a 


single memory location. 


Recall that in the absence of writers, no 
serial code is executed by our readers-writers 
implementation and that completely parallel 
behavior is also exhibited by our queue access 
method (unless the queue in question is full or 
empty). In contrast, standard "test and set" 
based solutions to the readers-writers and queue 
management problems use (very small) critical 
sections to protect the adjustment of their 
counters. We note for example that paracomputer 


‘simulations indicated a serial bottleneck in our 
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parallel codes for radiation transport until we 
replaced standard queue access methods with the 
ones given above. : 


Since we expect that on chip delay times 
will typically be less than chip to chip 
transmission times, the network overhead imposed 
by supporting the replace-add operation should 
not degrade network |. transmission time 
Significantly. Therfore, we believe that future 
parallel processors, utilizing something close to 
the hardware design presented above, can realize 
the replace-add in very little more than the 
execution time required for a public memory 


reference. We note that the "ultracomputer" 
group at N.Y.U. is developing a preliminary 
design for a prototype machine and operating 


system incorporating the ideas presented above. 
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Parallel Scheduling Algorithms*. 
Eliezer Dekel and Sartaj Sahni 
University of Minnesota _ 


1 Introduction 


~With the continuing Avawatic decline. in 
the cost of hardware, it is -becoming 
feasible to economically build computers 
with thousands of processors. In fact, 
Batcher [1] describes a ‘computer . (MPP) 


with 16,384 processors that is currently 
being built for NASA. In coming years, one 
can expect to see computers with a hundred 
thousand or even a million processing ele-— 
ments. 
study of parallel algorithms. 

Since the complexity . of 


a parallel 


This expectation has motivated the 


algorithm. depends very much on the archi-_ 


tecture of the parallel computer on which 
it is run, it 
architecture in mind ‘when designing the 
algorithm. 
have been proposed | and studied. 
paper, we 
the single instruction stream, multiple 
data stream (SIMD) model. Our techniques 
and algorithms readily adapt to the other 
models (eg: multiple instruction stream 
multiple data stream (MIMD) and data flow 
models). References.) to several papers 
dealing with algorithms for SIMD machines 
can be found in [2]. 


When measuring the effectiveness ofa 


In 


Several parallel architectures. 
this . 
-Shall deal directly only with 


is necessary to keep the. 


and d. 


is Q(n/(logn*n/logn) )=Q(1). 


x tas, Vet * £2 AL 
case. 
3 x=8 : schedule job i on machine 
[a,/£1 from ® to p, 
3X? D, :~ schedule job i on machine 
fa. /£1 from f-x to £-x+tp. 
:e@lse: schedule job i-on machine 
y cael from @ to p,;-x 
end case | 
If we nave: n PEs, ‘all the machine 
assignments can be computed in O0(1) time. 
However, using only n/logn PEs, these. 
assignments may be obtained in O(logn) 
time (i-.e., each PE computes at most 
[legn] assignments). So, the overall 
scheduling algorithm has a complexity of 
O(logn) and uses n/logn PEs. So, its EPU 


3-. Number of Tar dy Jobs: 

set of arn 
is the processing time of job i 
‘is its due time. Let S be any one 
machine schedule for J. Job i is tardy in 
the schedule S iff it completes after its 


Let - J=((p,. a.) li<ien} aerine a 
jobs. Py. 


‘ due time qs 


parallel algorithm, one needs to consider 
both its complexity as well as its cost in 
terms of the number of PEs used. The 
effectiveness of processor utilization 
{EPU). is the complexity of the best. 
sequential algorithm for P divided by 
one. of PEs used by A * complexity of 
2. Minimum Finish Time 


When preemptions are permitted, minimum 
finish time schedule for 
efficiently obtained using Mc Naughton‘s 
rule. Let p,,p r+e+,P, be the processing 
times of the n jobs. The finish time, 


a 


m machines is 


f, 


of an optimal preemptive schedule is given. 


by: 


f = max{ max (edit 
1<ic<n 


= pi) 


i=l 


Using f, the optimal schedule may be 
constructed in O(n) time . 


.gidering the jobs in 


Using the parallel algorithms of [3], 


computed in 
To 
also 


max{p; } and 2D; “may be 


6 (eas) time with ee PEs. 
the actual schedule, we 


l<i<n.. All the A,s can be 


in O(logn) time using n/logn PEs 
Let A,=@. Each job i can now deter- 
mine its Own processing 
using the following rule: 
* This research was supported in part by 
the Office of Naval Research under con- 
tract N@AG14-88-C-965G. 


‘obtain 
need 


rn com- 
Jo 
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assignment by 


cardinality 


to complete by its due time 
feasible set. 


takes 


Bpaseon and Moore [4] have 
an O(nlogn) sequential 
obtains ae schedule 
number of tardy jobs. 

The problem of finding a schedule 
that minimizes the number of tardy jobs is 
equivalent to that of selecting a maximum 
subset U of J such that every 
job in U can be completed by its due time. 
Jobs not in U can be scheduled after those | 
in U and will be tardy. A set of jobs U 
such that every job in U can be scheduled 
is called a 
It is well known that a set 
of jobs U is feasible iff scheduling jobs 
in U in nondecreasing order of due times 


developed 
algorithm that 
that minimizes the 


results in no tardy jobs. 


When p,=l, 1l<ixn, a maximum cardinal- 
ity feasiblé set U can be obtained by con- 
nondecreasing order 
of due times. The job j currently being 
considered can be added to U iff |ul<d,. 
Procedure FEAS(J,b) is a slight generali- 
zation. It finds a maximum subset of J 
that can be scheduled in the interval 
[@,b]. DONE(i) is set to -l if job i is 
not selected and is set to a number 
greater than @ otherwise. If DONE(i) > @, 
then job i is to be scheduled from DONE (i) 
- 1 to DONE(i). The procedure’ itself 
returns a value that equals the number of 
jobs selected. The correctness of FEAS is 
easily established. using an exchange argu- 
ment. Its complexity is O(nlogn) as it 
this much time to order the jobs by 


_ due time. 
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Let J be a set of n unit processing 
time jobs. Let D(i), 1<ix<k be the dis- 
tinct due times of the jobs in J. Assume 
that D(i) < D(itl), l<i<k. Let n(i) be 


the number of jobs in J with due time 
line Procedure FEAS(J,n, b) 
select a maximum number of jobs 
for processing in [8,b] n=ld|// 
1 set J; integer n,b; global 
DONE (1 :n). 


2 sort J into nondecreasing order of 
due times 


3 DONE(1l:n) <— -l /finitialize// 
4 jj B- | 

5 fori <—1 to n do 

6 case 

7 :j>b: eetienea 

8 :}<d.: j<—jtl: DONE(i) — 3 
9 end Case 


19 end for 
1l return(j) 
12 end FEAS 


Figure 4.1 
D(i), l<i<k. Clearly, = n(i)=n.. Let 
D(@)=@ and n(@)=0. Define F(i) to be the 
value of j} when procedure FEAS (Figure 
4.1) has just finished sila ay ag all 
jobs in J with due time at most D; It is 


evident that: 


F(@) = D(@) 
(4.1) oo 
F(i)=min{F(i-1)+n(i),D(i),b}, l<ick 


Expanding the recurrence (4.1), we obtain: 


. m 
(4.2) F(m) = min{ min {D(i)+ 3 ee oo 


1<i<m q=itl 
The maximum number of jobs in a that 


can be scheduled in [@,b], b>@, so that 
none is tardy is F(k). F(k) may be effi- 


ciently computed, in parallel as follows. 
Let the due times of the n jobs in J be 
d(1), d(2),...,d(n). Let d(@)=0. We may 
assume that d(i)>@, l<i<n. The computa- 
tion steps are: 
Step 1: sort d(l:n) into nondecreasing 
order. 23 
Step 2: determine the points r(@), ..., 
r(k-1) in -d(@:n) where the due times 
change I.e. r(i) < r(itl),l<i<k and 
da(r(i)) # d(r(i)+l). Let r(k)=n. 


Clearly, r(@)=8, and n(i)=r(i)-r(i-1) and 


D(i) = a(r(i)), ls<ixk; 2(8)=a. 
Step 3: since D(i) + 3 n(q) = D(i)+n- 
r(i),we compute | P(R}4min{ nt | min {D(i)~ 
r(i)},b} £1] os 

With n? PEs, step 1 can be carried 
out in O(logn) time [5]. Using n-1 PEs, 


the boundary points can be found in 
time. PE(i) simply checks to see if 
d(i)<d(itl), l<i<n-1. If so, then iis a 
boundary point. ~@ and n are also pouneeEy. 


O(1) 
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explained in 


6. Dekel, E. and Sahni, S., 


points. 


The boundary points have now to 
be © moved into memory positions 
r(@),r(1),...,r(k). This can be done in 


O(logn) time using n PEs and the data con- 
centration algorithm of [7]. Another data 
concentration step moves d(r(@)), d(r(1)), 
-»-, A(r(k)) into D(@), D(1), D(k). 
Using k+l PEs, D(i)-r(i), 9<i<k can be 
computed in O(1) time. min{D(i)-r(i)} can 
be obtained in O(logk) time using the 
binary tree computation model of [3]. As 
[3], only O(k/logk) PEs are 
needed for this; but using k/2 PEs is fas- 


oeoesgsp 


ter). F(k) can now be computed using an 
additional O(1) time. The overall ,com- 
plexity is therefore O(logn) and n” PEs 
are used. The EPU gf the above algorithm 
is Q((nlogn/(logn*n*)) = Q{1/n). 

4. Conclusions 

The extent to which parallel computers 


will find application will depend largely 


on our ability to find efficient algo- 
rithms for. them. The reader is referred 
to [6] for further examples of efficient 


parallel algorithms. 
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OPTIMAL LOAD BALANCING STRATEGIES FOR 
A MULTIPLE PROCESSOR SYSTEM : 


| errr M. Nic 7 
_ Department of Computer Science 
Michigan State University 
East. Lansing, MI. 48824 ~ 


_ Abstract -- To balance the workload among mul- 
tiple processors is of fundamental importance in 


enhancing the performance of a multiple processor 


‘system (MPS). Optimal probabilistic load balanc- 
ing policies are studied in this paper. Multiple 
processor systems are classified into four catego- 

ries according to homogeneous versus heterogene- 

ous processors and single-job class versus multi- 
ple-job classes. Closed-form solutions are deriv- 
ed for scheduling an MPS with single job class. 

An optimal load balancing algorithm is developed 
for an MPS with multiple job classes. The probabi- 
listic scheduling policy is easy to be implement- 
ed in. an MPS and can be extended to optimize mess- 
age routing in a computer communications network. 


I. Introduction 


A loosely coupled Multiple Processor System 
(MPS) consists of multiple number of independent 


processors receiving jobs from a common job sche- 


duler [2,3]. Such MPSs are considered a kind of 
distributed computer systems. The motivation to 
develop MPS is to allow resources sharing. and to 
achieve higher system throughput and reliability. 
The objective of this study is to develop optimal 
load balancing techniques for achieving the above 
goals. The system performance of such an MPS is 

generally indicated by the average job turnaround 
time. | | | oe 

~ In an MPS, the job scheduler is responsible to 
dispatch jobs among several processors. An arriv- 
ing job is routed to one of the processors accord- 
ing to the scheduling policy and the job. characte- 
ristics. Load balancing can be done either deter- 
ministically or probabilistically. The determinis-— 
tic routing assigns the next processor depending 
on the current state of the system. The probabi- 

listic routing dispatches jobs in a proportionate 
approach, which is independent of the system 
state. 

Only probabilistic ‘routing strategies are con- 
sidered in this paper. The job scheduling. proba- 
bilities are solved for each job class to each of 
the processors with given work load and job assign- 
ment pattern. As a result, the minimal average 
job turnaround time can be: achieved. An optimal 
deterministic routing policy (if exist) should 
provide a better system performance than that pro- 
vided by an optimal probabilistic routing policy 
[6]. To prove the optimality of a specific. deter- 
ministic routing policy is a nontrivial task. 
‘Usually, a deterministic routing policy must be 
compared with other routing policies to display 
its superiority. Most performance evaluation 
under a deterministic routing policy is conducted 
on MPS with only two or three processors [2,3,5]. 
A probabilistic job routing policy is easier to 
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‘packet classes, 
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implement in MPSs with arbitrary number of pro- 
cessors. The scheduling overhead is low, because 
current processor information is not needed. The 
probabilistic approach can be also used to 
evaluate existing deterministic routing policies. 
Recently, Chow and Kohler [3] presented a queu- 
eing model to analyze a single-job-class and hete- 
rogeneous MPS. They proposed a proportional branc— 
hing policy, which assigns the job scheduling 
probability in proportional to the processing 
speed of the processor. The proportional branch- 
ing policy can prevent the queue from saturation, 
but cannot minimize the average job turnaround 
time. For the deterministic case, they presented 


an approximated numerical method to analyze a two- 
processor heterogeneous MPS. Towley studied the 
deterministic - routing ‘in a closed queueing net- 
work [9]. A 


single-server -processor~shar ing 
system with many job classes has been studied by 
[4]. Baskett, et al. studied the behavior of que- 
ueing networks with different classes of custom— 
ers [1]. 

Some related researches were conducted by 
[6,7] in packet switched computer communications 
networks. Computer network generally assumes fix- 
ed routing (probabilistic routing), since it is 
easy to describe by means of a routing table. 
Adaptive routing (deterministic routing), on the 
other hand, is complex to describe, and requires 
simulation to evaluate channel flows and delays. 
Furthermore, it was shown by [7] that at steady 
state, flow patterns and delays induced by good 
adaptive routing policies are very close to those 
obtained with optimal fixed routing policies. 


-Foschini - [6] studied deterministic routing polic- 


ies in a packet © switched » network with multiple 
where the outgoing trunks have 
different capabilities. He employed a diffusion 
analysis to study the effect of routing strateg- 


ies under a nearly overloaded situation. 


Optimal solutions to.the load balancing pro- 
blem are developed in this paper for a multiple 


‘processor system with single job class. The pro- 


portional branching policy suggested by Chow and 
Kohler [3] is formally proved to be nonoptimal. 

extends the MPS environment from 
single job class to multiple job classes. An 
optimal algorithm is developed to. calculate the 
optimal job scheduling probabilities for each 


| processor with multiple ‘job classes. A comparison 


of various load balancing policies: for .MPSs is 
also given. 


Il. System Classification and Scheduling Models 
A ‘homogeneous MPS contains identical “pro- 


cessors. Whereas, a heterogeneous. MPS contains 
different processors. Depending on the processor 


capability and assignability, jobs are classified 
into multiple classes. Different classes of jobs 
are to be assigned to different subsets of pro- 
cessors. In terms of processor capabilities and 
job classes, an MPS can be classified into one of 
the following four categories. 


SCHO: Single job Class HOmogeneous system. 
SCHE: Single job Class HEterogeneous system. — 
MCHO: Multiple job Classes HOmogeneous system. 


MCHE: Multiple job Classes HEterogeneous system. 


| 


Queues of jobs are formed at each processor 
based on the stochastic. nature of job arrival. and 
given job classification. The single most import- 
ant performance measure of an MPS is: the average 
job turnaround time. This includes the time from 
the submission of a job through the dispatcher to 
its completion by one of the processors. SCHE 
systems have been studied by Chow and Kohler [2]. 
A queueing model for an SCHE system is shown in 
Fig.la. Jobs from the same class are dispatched 
to the j-th processor with probability S;. The 
model can be generalized to consider multiple 
classes of job arrivals to the dispatcher as 
depicted in Fig.lb. This queueing network is used 
to model the scheduling environment of an MCHE 
system. With minor modification, it can be appli- 
ed to other three classes of multiple processor 
systems as well. 

Each processor in the MPS is modeled by an 
M/M/1 queue. Let n be the total number. of process- 
ors and m be the total number of distinct job 
classes. For n processors, we have n independent 
M/M/1 queues. The i-th job class has a Poisson 
arrival rate with mean A,. The j-th processor has 
an exponentially distributed service rate with 
mean ws, . Upon the arrival of a new job, the job 
dispatcher is responsible for assigning the job 
to one of the processors. The probabilistic 
scheduling policy is independent of the state of 
the system. The state of the system is represent-— 
ed by the number of jobs in each of. the queues at 
any instance. The first-come first-served (FCFS) 
queueing discipline is assumed, and jockeying is 
not allowed in this study. | ; 

Let M={1,2,...,m} and N={1,2,...,n}. be two 
sets representing. indices of job classes and pro- 
cessors respectively. Jobs in different classes 


arrive independently. The total job arrival rate 


» A, 1s the sum.of all different classes of job 
arrival rates, ;.. The job assignment matrix A 
=(ajj) is.an m by n matrix, where a,;, indicates 
that the i-th class job can be exetuted on the 
j-th processor; aij =0 otherwise. The job schedul- 
ing matrix S=(Sj; ) is an m by n matrix, where 
Si; is the probability of the i-th class job 
being assigned to the j-th processor. Obviously, 
S45, 70 if 4;,=0. After the job scheduling matrix 
is determined, the actual job arrival rate, i} P 


to the j-th processor. can. be expressed by 
 :4 t 
oat 


= £s8,.A 
ieM ia 
Since. each arrival source is a Poisson 


process, the linear combination of them is also a 

. : ° . e P 1 
Poisson process with mean arrival. rate Ay 
Hence, we have 


ing matrix resulting 
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A= EA, EA, (1) 
ieM jen 2 
Once the scheduling matrix S is determined, 


the model in Fig.lb can be decomposed into n inde- 
pendent M/M/l1 queues, where the j-th queue has 
mean arrival rate A; and service rate p, respec- 
tively. All queues behave independently but cons- 
trained by the linear relation in Eq.(1). 

An M/M/1 queue is solvable under the unsaturat- 
ed condition, as us . At equilibrium state, the 
average job turnaround time among jobs serviced 
by the j-th processor is calculated by T. 
l/(us, - Ns ) for all j. We want to find a particu- 
; gatisfying Eq.(1) to mini- 


lar assignment of A 
mize the average of all T.'s. Specifically, we 


define the average job turnazound time 


T= LT,(Ai / A) | (2) 
sSNA | | 

The problem of finding an optimal job schedul- 
in a minimal average job 
turnaround time can be formulated as a nonlinear 
programming problem as follows: 


Minimize T= = TQE/ 
jen a. 2 
provided that 
Mee E aeoe tl for ieM (3) 
- jen iJ . 
A < Eu, (4) 
jen J 
subject to 
s;,20 for ieM, jeN (5) 
Zs,,51 for ieM (6) 
jen J 
AL = A844 < Uy for jeN (7) 
J oaem 7 OS 
Condition in Eq.(3) prevents any one class of 
jobs from saturating the system. Condition in 
Eq.(4) ensures that the total job arrival rate is 
less than the total service rate. Constraint in 
Eq.(7) prevents any processor from saturation 


during the scheduling process. 


IIL. Optimal Load Balancing with 
Single Job Class 


In a single job class environment, the job 
assignment matrix A is a 1 by n row matrix with 
all. components equal to 1. Also m=l, }=),. We 
shall use S. to represent the probability s,.. 
The optimization problem stated in Sec.II can te 
simplified to 


| Minimize T = 5En $3 /(u5-A8 9) 4 re (8) 
"provided that - 
subject to- | 
| §, 20 for jen. nae 61s 
Se BA pe ee Fe a) 
8) <u for jen i 4 7 any 


To minimize the objective function T in Eq. 


(8), we employed the Method of Lagrange multi- — 


plier. Due to page limitations, proofs of all 
the following theorems are skipped. Interested 
readers may refer to [8] for details of all the 
proofs. 

* The objective function in Eq. (8) can be proved 
convex with respect to & for all j. In an SCHE 
system, processors may have different processing 
speeds. Obviously, a processor with higher servi- 
ce rate should have higher probability to be ass— 


_igned with jobs. Without loss of generality, the 


service rates {(U.} of the n PEOer es ores are denot- 
ed in 1 descending order : | 
w > Bo > soe. 2 7 > 0 = | a (13) 


Theorem pie . 
-In an SCHE system, “the job scheduling matrix 


s, which minimizes T and satisfies the cons- 
traints in Eqs.(10)-(12), has the following 
probabilities. i | 
Se [u i (O,-A)/B V/A for 1sjxk_ 
a a ie ee (14) 
a =O  — y for k<jsn 
where | . k 
i=l j=l 


and k is determined ‘by rhe job arrival rate i as 
follows: . 


6, -ViL By Py Ge a: ae TO 


k+l Trae for Isken (15) 


or | 
6-48, a Os for e (16) 


with this optimum assignment, we obtain 
T,=8,/ (YH, (8,-A)) for 1<j<k 


and the minimized average job turnaround time 


4 | 
BL k (6, -d) 

Te —_ : (17) 
(8, - A) | 
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s tated in Eq. a 7). 


- the one with maximum service rate. 


‘Note that As = “Ss +» This means. that the actual 
job. arrival. ‘rake to the j-th processor ‘is equal : 


‘to the service rate of the j-th processor subtrac- 


ting - a term which is proportional to the square 
root. of. the service. rate of the j-th processor. 
In an SCHO system (yon, for all 3), the job -sch- 
eduling matrix S has. equal probability S, = l/n. 
for all j. This means that -jobs are assigned 


randomly among. processors with equal probability 


as expected. 


In a light traffic eiivironment, ony! the Sick : 


k processors are assigned | with jobs. as stated in 
Theorem.1. 


The | average job turnaround time under 
this circumstance is faster than the service time 
of - any of the remaining n-k slower processors. 
This fact: is. proved in (8) by showing that ThA; 
for all j>k. 

Chow and Kohler. proposed a pro eitoaal anche . 


ing policy for an SCHE system (3]. The scheduling 


probabilities are proportional to the service 


rate of processors, but independent | of the job 


arrival rate, i.e., Sj = Hj / 8, for all j. We have 


‘discovered in [8] that this proportional branch- 


ing policy is not necessarily optimal. 


Most scheduling. studies on loosely coupled 


 MPSs were conducted in a single-class job environ-_ 
ment. In what follows, we compare our scheduling 


policy with two, known policies in a single-class 


job environment. 


(1) the pieportionel branching policy prongeen by 
Chow and Kohler. 
By Eq.(8), the average job turnaround time for 
the proportional branching | policy, Ty» can be 
expressed as” T,=n/(@ -A). | 


(2) the Sptiaal probabilistic. 
proposed by. Ni and Hwang. 
The average job turnaround time for the optim- 


al probabilistic scheduling policy, Tos was 


‘schedu Ling policy 


(3) the | deterministic scheduling peu proposed 

by Foschini. © 

A deterministic policy routes an arriving job 
to the processor that offers the least expected 
turnaround time. An arriving job is sent to the 
queue which has the minimum ratio of the queue > 
length to service rate. If minimum ratio is not 
unique, the job dispatcher selects from the ties 
A generalized 
version of this policy was studied by Foschini | 
{6]. This policy is considered the best schedul- 
ing, policy for an SCHE system [3]. 


(4) The ideal. scheduling with a ‘single fast _pro- 
cessor. | 
This corresponds to the case when a system has 
single processor whose service rate is the sum of 
service rates of all n individual processors in 
an MPS. This is an ideal case because the single 


processor has. the same capability of the whole 


MPS but the ill effect due to load unbalancing 
disappeared. This ideal case is included. for 
comparison purpose only. The single processor is 
a standard M/M/1 queue with service rate 8,. The 
average job turnaround time of such a system 
equals T, = 1/(@,7A). : 


_ There is no doubt that the deterministic sche- 
duling policy will result in the least turnaround 
time. However, closed-form solution of the 
average job ‘turnaround | time can not. be obtained 
for deterministic scheduling. One approach | to 
obtain a meaningful solution requires to perform 
extensive simulation experiments which are rather 
time-consuming. Chow and Kohler developed an 
efficient technique for analyzing deterministic 
scheduling in a two-processor: SCHE system. Their 
solutions are accurate only for a light traffic 
environment. When the job arrival rate approaches 
the total service rate, the accuracy begins to 
deteriorate. | ! 
Consider a two-processor SCHE system. 

v1 = 4 and yo = 1, we observed from Fig.2 that 


When 


T, 27,27, 2T, (18) 
for any choice of A. Note that when A is close 
to the total service rate, T, can not be obtain- 
ed due to the light traffic assumption made by 
Chow and Kohler. Under light load conditions, 
T and T, approach the performance of the 
single fast processor, because most of the jobs 
are assigned to the fast processor. When the 
arrival rate increases, the deterministic sche- 
duling policy displays its superiority over the 
probabilistic scheduling policy. The rapidly 
dec lined per formance of the proportional branch- 


ing policy is due to its failure condsdentng the 


effect of the arrival rate. 


IV. Environment of Multiple Job Classes 


The number of unknown scheduling probabilities 
for a multiple-job-classes environment equals the 
'mumber of nonzero elements in the assignment 
matrix A. In terms of the unknown scheduling 


probabilities, S535 the objective function ‘can 
be expressed as | | 
1 on om om 
ee 2 maps Ed48,4)] (19) 
jel i=l J isi * 


There are two obstacles which prevent a direct 
solution of Eq.(19). First, the objective func- 
tion T can not be proved to be convex. Secondly, 
even if T is convex, the method of Lagrange multi- 
plier cannot be used to simplify the problem, 
because at least m Lagrange multipliers are 
required. | | 7 

In our model, the average service rate of each 
processor is assumed. time-invariant. In other 
words, different classes of jobs assigned to the 
same processor have the same average service 
time. From the viewpoint of a processor, 
classes do not make any difference in achieving 
the average service rate once a job has been 
assigned to it. Let us temporarily ignore the job 
preference restriction, that is, each job can be 
assigned to any of the processors. The total. job 
arrival rate can be calculated by Eq.(1). | ‘The 
optimal job arrival rate assigned to each process- 
or can be derived directly from Theorem 1. Specif- 
ically, if §, denotes the optimal job assignment 
rate to the’ j-th processor, then east ‘S; for 


job 
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all j- This does suggest how an optimal assign- 
ment in an MPS with multiple job classes .can be 
achieved. If we distribute different job classes 
to multiple processors such that the actual job 
arrival rate,r} , equals the optimal job assign- 
ment rate,éd. , for each of the processors, the 
optimal job scheduling matrix can then be calcu- 
lated. More specifically, 


for jeN. (20) — 


Equation (20) is basically a set of n linear 


equations over more than n variables. This impli- — 


es that there may exit none, or one, or infinite- 
ly many solutions to Eq.(20) subject to the cons- 
traints given in Eqs.(5) to (7). The case of no 
solution must be avoided; whereas, the other two 
cases are acceptable in the search of an optimal 
scheduling matrix. The following example shows a 
singular case in which solution does not exist. 

Consider an MCHO system with two processors 
and two different job classes as illustrated in 
cae 3 with dA 4,42 = 2, and y» 10. Also only 
or) =Q for the job aug ighment matrix A. | 

From Theorem 1, we obtain S, = 0.5 ‘and s, 

0.5. Therefore, both 6, and 6, are equal to 3. 
Subaricatine these values into Eq.(20), we. obtain 


eal + 2s 
22 

Obviously, s,, = 1.5 > 1 violates the constraint. 
Therefore, the solution does not exist. In this 
example, the first job class must be assigned to 
the first processor.: The optimization problem 
becomes how to find the optimal scheduling proba- 
bilities, $5) and s provided that A, was 
assigned to § the. ‘fisse. processor (3,45 =1). In 
general, the problem of finding the optimal job 
assignment rate, 6. , with some Becees se eenr of 
jobs can be formulated as follows. 

Let c. be the preassigned job arrival rate 
to the jth processor. The preassigned rate may 
come from any of the job classes, but equally 
treated by the processor. Let n. be the arrival 
rate of the i-th job class, in which the assign- 
ment has not been determined. We shall refern; 
as the unassigned job arrival rate of the i-th 
job class. The preassigned and the unassigned job 
arrival rates are related by . 


= 3 
AG 


k 
n= In, =A-ZTe, on where YS. re,(21) 
ieM jeN j=1 
Let S. be the probability of jobs assigned 


to the jth processor over the total unassigned 
jobs with arrival rate jn . The problem of finding 
the optimal assignment rate to each processor, 
with some preassigned arrival rates, {c.}, and. 
a given total umassinged arrival rate,-n , is 
formulated as follows: ~ 


nS. + c.. 
J J 


A(p.-nS.-c. 
Cu, ns. re 


Minimize T = 
j 


tmp 


(22) 
: | 


_ provided that 


for ieM | (23) 


ho Dal, 
i. jen ij J 
h=enty.< 2u,260 . | (24) 
Tn jen a a. 3 
c. < H, “e jeN 25 
| J "5 : a 
- subject to 
S52 a) for jeN (26) 
Sed . ~~ «..%. ws OD 
jen ? » ¢ . 
| S, + c. <u,” for. jeN (28 
anes j ei j C28) 
It can be easily proved that the. objective 
function in £Eq.(22) is convex. Closed- form 


solution © for the above constrained minimization 
problem | is stated in Theorems 2 for heterogeneous 
multiple processor systems. Without loss of 
generality, we order the subscript j such that 


(29) 


“(uyre1) /Vty 2 (g-e2) /Vip 2 oe 2 (une) /n, 


Theorem 2: 
The optimal job assignment {6; } to a 
heterogeneous MPS with unassigned total arrival 
rate n and some preassigned rates c, for all j, 
which minimizes T in Eq.(22). “gubject. to the 
constraints in Eqs. (26)-(28), can be evaluated by 


Gyre) Ving Ho) /By. for lejck 


(30) 


ji fo -téecketien } 


whens k is determined” by the unassigned job 


arrival rate n as follows: 


a eng 


for l<k<n . 


(Gy, “yA red My, <n 5 


Pia ea” Sit) /¥y Wet 


or a - (31) 


| (8, Vq) “By gE) My < n as for k=n 


The physical. meaning of the optimal assignment 
in Eq.(30) can be interpreted as follows. If the 
traffic is very heavy, ise. A=ntyp approaches 
6 , the optimum assignment is very close to the 

n 
available capacity of the processor, Lees 
ec, for the j-th processor. When the traffic 
bécomes light, the optimal job ‘assignment is 
formed by subtracting a value proportional tov“. 
from the available capacity of that processor. I 

c.=0 for all j, Theorem 2 becomes equilivalent to 
r eorem 1. If W4=H for all j in Theorem 2, 
Theorem 2 can be applied to a homogeneous MPS. 
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matrix S_ for 


4. Form M"=M-M'. and N"=N-N'. 


Vv. An O timal Load Balancing = orithm 


& recursive | speieal.; load Getace ine migori the 
is developed below to generate. the scheduling 
an MPS in a multiple-job-class 
environment. Although there may be many solutions 
to S, the average job turnaround time is: unique 
and | minintzed for all possible solutions of S. 
Our purpose is to find a ‘systematic | procedure to 
generate at least one of the possible solutions 
of S. The. ‘notation. C. is used to denote the 
rate of the. i-th class Jjob assigned to the j-th 
processor. All aij ,Cij. Na, and Hy. (for I<isn, 
lgj<n) are global variables. M is a set of active 


job classes. ieM indicates that the assignment of 


the i-th: job class has not been determined, i.e., 
ni¥O . N is a set of active processors. jeN 
indicates that a, ,%0 for at least one keM. Both M 
and N sets are local variables. 


The Load Balancing Algorithm: 


Input: Global wariabled: aij» Cage Ny; a 


Local variables: M and N sets. 
Output: Changes on those global variables. 


Procedure: 


1. For each ieM with , ayy =], find a ‘particular. 
-k such that aj,=l. Then set cikeCiktni, ny+0 , 


and aj,+0 for that i. Update M and Ne 
2. For those jeN, calculate the correspond ing job 
assignment rates 8. for jeN (by applying 


Theorem 2, where n=Zny and Cy =Icy 


3. Form a set N', where jeN' if jeN Jand “Engels i5<84e 


Form a set M', where icM' if ieM and there is 
at least a jeN' such that ajj=l. If N'A, 
invoke this algorithm with inputs M' and N'. 

For each j eNn" with 

Lay: find a particular k such that. ans =] 

and 440. Set. C4 *CK jt, and nun 55 . 

5. If N’=0 in (3) or at“ least one particular k 
was found in (4), invoke this algorithm again 
with local inputs wu" and N". 

6. Update M and N. Solve the following set of 
. linear equations. 


for j €N 


ron, i378 
ieM 
EX... =1 for i eM 
Gen ij : | 
_ where ki. 8 are unknown variables satisfying 


X 70 if a,-0 5 X20 if a,,=1 for ieM, jen 

This set of 
infinitely many 
solution is sufficient. 


7. For j€M and jeN, set c, 14°C44 4K, 4n i and a, 5°. 


linear ‘equations always has 
solutions, Picking any one 


In the main program: 

1. Given m job classes with. average arrival | rates 
Mes , N processors with average service rate lll 
an mm by a job assignment _ matrix A, aad 
conditions in Eqs.(5)-(7). 

2. Initialize local variables M={1, 2,ee¢,m}>. and 

~N={1,2,...,n}. | 

3. Initialize global variables; e450 and nits 

for all ieM and jeN. _ 

Invoke the load balancing algorithm. | 


~ 


5. For 1gi¢m and 1¢j<n, calculate 855 =c../i.- 
VI. Conclusions 


Optimal probabilistic load balancing policies 
are developed for a multiple processor system 
with either single job class or multiple job 
classes. Those policies provide a test bed to 
determine the superiority of any deterministic 
scheduling policy over probabilistic ones. With 


the high implementation overhead of deterministic 


policy, we conclude that the proposed probabilist- 
ic scheduling policy is more feasible and can be 
systematically implemented in commercrel mUTEApte 
processor systems. , 
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arrivals ; departures 


(a) A queueing model for probaba Mietxe load 
balancing in an SCHE system. : 
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dispatcher 


(b) A queueing model for probabilistic load 
balancing in an MCHE. system. 


Fig. 1. Probabilistic load balancing models for 
a multiple processor system. 
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Fig. 2. Comparison of three load _ balancing 
policies for a two-processor hetero- 
geneous system with single job class. 
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Fig. 3. The queueing model of an MCHO system 
= § with two processors and two job classes. 
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; Summary 


Our research addresses the problem of task 
assignment in distributed multiprocessor 
systems. By distributed multiprocessor systems 
We mean any configuration of processors in which 
the cost of communication between processors is 
non-negligible. <A set of tasks to be assigned 
to the processors are referred to collectively 
as a distributed process. In order to achieve 
their common goal, each of the tasks performs 
two activities: execution on one of the 
processors, utilizing the local memory and 
resources of that processor, and communication 
with one or more other tasks in the distributed 
process (transmission of data and/or 
synchronization information). An assignment of 
tasks to processors designates one processor for 
each task to reside on for the lifetime of that 
task and is thus a static assignment. 


More precisely, we define a distributed 
process as a set of k tasks T={t,, to» 
In a multiprocessor system containing n 
processors P={p,, po, e+e, p_}; let x,. denote 
the execution ‘cost of task t, whe it is 
assigned to and hence executed on processor p.. 
Let cy denote the communication cost betweén 
two tatks t, and t, if they are assigned to 
different processtrs. | Throughout our 
discussion, we will assume that the 
communication cost between two tasks. executed on 
the same processor is negligible and _ that 
communication costs are independent of processor 


(as in a. fully-interconnected network of. 
processors). An assignment of tasks_ to 
processors can be formally described by a 


function from the set of tasks to the set of 
processors, £:T -> P, and an optimal assignment 


is one which minimizes some prespecified 
performance criterion. 
We consider two different performance 


criteria for optimal assignments of tasks to 
processors: minimization of the total sum of 
execution and communication costs and 
minimization of the execution and communication 
costs incurred by the processor with maximum 
cost. The latter is also referred to as 
minimization of latest finishing time because of 
the equivalence of this problem to certain 
deterministic scheduling problems. Asan 


* This work was partially suppor ted by the U.S. 
Office of Naval Research under Contract No. 
NO0014-79-C-0775 and by the U.S. Dept. of En- 
ergy under Contract No. DE-AC02-76ER02383. A003. 
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between each pair of task nodes t, and t: 


efficient . 


individually as Part I (leeretive): 


illustration of these per formance criteria, 


consider a system of 4 tasks and 3 processors. 


For the — assignment f(t,) = p,,_ £(t,) = Po» 
£(t4) = Po» and f(t,) = the total “sum of 
execution and communication costs is | 
Hyp PMyg thy oth, gtey ghey fey 4 tag tty, and the 
latest finishing time is max {f1,£2 £3} ieee 

EL = x) Fey Fey ghey yr 

£2 = Hy gthg ate ghey gttogtegy 

ES = Hy gtey gteg ghey 4° 


Task assignment to minimize the ‘total sum 
of execution and communication costs has been 
analyzed using a network flow model and network — 
flow algorithms by a number of researchers [1]- 
[2], [6]-{8]. A system of n processors and k 
tasks can be modeled as— a network by letting | 
each processor ‘be a distinguished node and each 
task be an ordinary node. An edge is drawn 
and is 
given the weight c, There is” an edge from 
each task node t, to bach processor node. Po with 


the weight 


an - QQ) 


p*q 


Ww 


iq ip 


An n-way cut is a set of edges which partitions 


the nodes of the network into n disjoint subsets 
wth exactly one processor node in each subset 

and thus corresponds naturally to an assignment 
of tasks to processors. The capacity of an n- 


way cut is the sum of the weights on the edges 


in the cut and is exactly equal to the sum of 
execution and communication costs incurred by 
the assignment because of the judicious choice 
of weights according to equation (1). 


- For 2 processor systems, known see ietene: 
Max Flow/Min Cut algorithms can be used to find 
an optimal assignment [7]. However, the problem 
of finding a minimum n-way cut for n > 2 is NP- 
complete and is thus unlikely to have any 
(polynomial-time). | solution. 
Therefore, we have devised the following group 
of efficient heuristic algorithms which together 
yield optimal or near optimal assignments for | 
tasks in a general n-processor system. 
Simulation results indicate their per formance to 
be very good. The. group of algorithms is 
described. collectively as Algorithm A and 
Part II 
a and Part III (Greedy). 


Part I is derived using the network model 
of the n-processor system described above and is 
designed based. on the following known result 
[7]: Consider a network G. obtained from the 
n-processor network by replacing the set of 
processor nodes P — {p,} with the node p, and by 
replacing edges from each task node to ‘the set 
of processor nodes P - {p,} with one new edge 
with weight equal to the stim of the weights on 
the replaced edges. The minimum cut in the 
network G, with p, and p, as distinguished nodes 
induces @ partition ofJ nodes in G, into two 
disjoint subsets, A, containing , and A 
containing p,e In atl optimal task assignment} 
tasks in A ake assigned to processor Py 


In each iteration of Part I, the Max Flow/ 
Min Cut Algorithm is applied for each processor 
node p,. and p, as distinguished nodes (as 
deacribdd: above) “co determine: the eubact. of 
tasks assigned to p,. The resultant assignment 
may be partial in tilat there may be tasks which 


remain unassigned. Let denote the set of 
tasks which remain. unassigned affer n 
iterations. We construct a network G from 


the network G used in the mth iteration by 
deleting from G™ all task nodes not in and 
by redefining the execution cost for t, in on 
processor p, as 

ae = a plus the sum of communication costs 

between t, and all tasks already assigned to 

processors other than p,. 
The weight on _ the edge! from t, to p 
recalculated according to Equation (1) 
these new values of execution cost for all tasks 


is 


in . The process of applying the Max 
Flow/Min Cut Algorithm in the network. G with 
p. and p, as distinguished nodes for each 
processor . is repeated. The iteration process 


halts when~either all tasks are assigned (in 
which case the assignment is optimal) or when no 
tasks are’ assigned in the last iteration. In 
the latter case, Part II of Algorithm is invoked 
on the subset of tasks not assigned by Part 


I. 

IL computes a lower bound L on the cost of 
an optimal mn-way cut when more than one 
processor is utilized for a reduced network 


containing the unassigned task nodes and the 
processor nodes:. 


c= § 


t,eT : 
1 


Ltr 


where c(p_,p,) is the cost of the minimum cut 

for some arbitrarily chosen processor Ps and 

processor p,. ; 
Based on this lower bound, the algorithm then 
checks to see if it would be cheaper to assign 
all remaining tasks to one processor. If so, 
the tasks in are all assigned to the one 
processor yielding minimum total execution cost 
for those tasks. In this case, the resultant 
assignment in combination with the assignment 
from Part I is optimal. Otherwise, Part III is 
invoked to complete the assignment. . 


Jwith 


359 


Part III, Algorithm Simple Greedy, locates 
clusters of tasks between which: communication 
costs are "large". Tasks in a cluster are then 


assigned. to the same processor, and _ the 
resultant assignment may be _ suboptimal. In 
particular, Simple Greedy computes C,~ the 


average communication cost over all pairs of 
tasks. Simple Greedy then deletes all edges for 
which cy < C. Each edge e = (t,,t,) for which 
c,, > Cds then examined. Let 6, “be the task 
etdster containing t, and G, be the task cluster 
containing t.. The ‘algorithm tests to see if 
there exists~a processor for which the total 
execution cost for all tasks in G. UG, is non- 
infinite. If so, the two clusters are merged 
into one large cluster, edges between tasks in 
the new cluster are deleted, and the process 
continues. When no more edges remain, each 
cluster is assigned to the processor with 
minimum total execution cost for the tasks in 
that cluster. : 


| In order to evaluate the performance of 
Algorithm A, simulation runs were made on data 


consisting of randomly generated task-processor 


configurations under the assumption that tasks 
tend to form clusters and that communication 
costs between tasks within a cluster are on the 
average larger than communication costs between 
tasks in different clusters. Task systems with 
6 to 20 tasks and 3 to 5 processors were 
simulated. The number and size of task 
clusters, the intra-cluster and inter-cluster 
communication costs, and the execution costs 
were all generated from uniform random 
distributions. In addition, simulations were 
performed in which the entire assignment was 
performed by Simple Greedy alone. The table 
below shows the distribution of the ratio of 
latest finishing time for a heuristic algorithm 
(Algorithm A or Simple Greedy) to the latest 
finishing time of an optimal algorithm. 


Optimal | 
=] < 11/10 .< 5/4 > 5/4 
Algorithm A 69% 25% 64 02% 
Simple Greedy 712 14% 112% 47 
We have also investigated the task 


assignment problem using: an: approach based on a 
classical model from deterministic scheduling 
theory [3]-[4]. We restrict our attention to 
the assignment of k independent tasks on n 
identical processors: taking into. account the 
over head: of. communication: between tasks assigned 
to different processors. Let P, T, and (c.,) be 
the set of processors,. set: of tasks, and matrix . 
of communication cost’s,, respectively, as before, 
and let let (x,) be a: vector of execution costs 
where x, is the cost of° executing task t, on 
each of the processors. The latest finishing . 
time (LFT) of all tasks in T for an assignment f 
is defined as | : : 


We = eice ” te x= ve Re c.]- 
ere eee 


The latest finishing time is thus the sum of 
execution and communication costs incurred by 
that processor for which execution costs plus 
communication costs is maximal. over ail 
processors. An optimal assignment fy is one 
for which latest finishing time is ottimal .. We 


note that an assignment which minimizes latest 


finishing time of all tasks in the set T also 
maximizes utilization of the processors - in the 
system. 


We restrict our attention to systems in 
which communication costs are a simple monotonic 
non-decreasing function of execution costs. 
This assumption can _ be justified when 
interprocess communication occurs primarily due 
to data exchange (e.g., when the tasks form a 
‘producer-consumer pair). It can also be 
justified in program behavior models in which 
each task corresponds to one of k disjoint 
program localities and communication costs incur 
only during transitions between localities. 


) we have the 


For the case Cay = a(x, ° x, 
following results: |. 
(1) Let X=. Ff x, and n is the number of 


t,eT | : 
processors. For +a y=, in an optimal schedule 
all tasks are assigned to one processor. 

(2) The problem of task assignment 
minimize LFT for task systems 
execution and communication costs taken into 
account is equivalent to the same peoptee with 
communication costs ignored. 

(3) The Longest Processing Time heuristic 


to 


(LPT) assigns tasks according to the following — 


rule: whenever a processor becomes available, 
the task with the greatest execution cost among 
those tasks not yet assigned is assigned to the 
free processor. We have the following tight 


bound: 
opr, 4 1 
mo oS oa 
OPT 


(4) If for all tasks t, in T, 


x,e{u, 2u, 4u, eee, aru, coe} 
for some constant u, ~ then the assignment 
- produced by LPT is optimal. 
For the case Ci = a(x, + x5) we have the 
following results: — | 
(l) Let X= 2 Ky and k is the number of 


t,eT 


tasks and nis the number of processors. For 


a > (n-1)/(k+(n-2)), in an optimal schedule all 


tasks are assigned to one processor. 


(2) For the LPT heuristic we have the 
following loose bound: _LPT < 2. 
. OPT 


The proofs of these results can be found in [5]. 


with both . 
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[3] 


Our results indicate that useful 
suboptimal algorithms for the task assignment 
problem exist in both the case where the goal is 
to minimize total execution and communication 


“costs and the case where the goal is to minimize 


latest finishing time. The former goal takes a . 


‘global view of the system and aims to minimize 
‘total resource usage. 


The latter goal treats 
concurrency as the main factor in working for 
optimality with respect to resource usage. Both 
approaches represent impor tant concerns for task 
assignment algorithms. 
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